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Reader Aids — 

Purpose: Present a general model 

Special math needed for explanations: Statistics and linear algebra 

Special math needed to use the results: Statistics 

Results useful to: Software reliability theorists and analysts 

Abstract — Miller & Sofer previously presented a new non- 
parametric method for estimating the failure rate of a software 
program. The method is based on the complete monotonicity pro- 
perty of the failure rate, and uses regression to estimate the cur- 
rent software-failure rate. This paper extends this completely 
monotone model and demonstrates how it can also provide longer- 
range predictions of reliability growth. Preliminary ev aluation in- 
dicates that the method is competitive with parametric approaches, 
while being more robust. 


1. INTRODUCTION 

Suppose a program is executed for a length of time T . Dur- 
ing this time, n bugs are detected and removed when they 
manifest themselves as failures. The successive failures occur 
at times: 

0 < t x < t 2 <...< t n < T. (1) 

When bugs are corrected without introducing new faults, the 
program evolves into a more reliable program, hence the term 
“reliability growth 1 ’. Given the past software data (l) we want 
to make various statistical inferences concerning the current and 
future reliability of the software. In particular we are interested 
in — 

• the number of failures anticipated over some future horizon 

• the future failure rate after an additional specified time of 
debugging. 

Over the years, many competing models for software- 
reliability growth have been developed, eg, Duane [6], Jelin- 
ski & Moranda [8], Goel & Okumoto [7], Littlewood [9], and 
Musa & Okumoto [16]. These are all parametric models, and 
have a common property: complete monotonicity of the failure 
rate. 


Notation 

N(t) (random) number of failures observed in [0,/] 

M{t) E{N(f)}, mean number of failures, viz, the mean 
function 

r (r) dM(t)/dt,0<t, intensity funedon of the point process 

{N(t),0<t} 

The r(t) is also referred to as the failure rate of the process, 
although failure intensity is probably a better name. A function 
r( • ) is completely monotone if and only if it has derivatives 
of all orders, and they alternate in sign: 

> o, t > 0, q = 0,1,2 (2) 

dt q 

Miller [12] has shown that software under random time- 
homogeneous testing or usage with perfect fixes shows com- 
pletely monotone reliability-growth, and conversely that vir- 
tually all completely monotone functions can occur as intensity 
funcuons of reliability-growth point processes. Thus, a general 
approach to software-reliability growth modelling should include 
the entire class of completely monotone intensides. Reliability- 
growth prediction based on a single parametric family of 
reliability-growth processes cannot be justified. 

Miller & Sofer [13] previously introduced a nonparametric 
model for software-reliability growth which is based on com- 
plete monotonicity of the failure rate. The method uses regres- 
sion to esdmate the current software failure rate. Miller & Sofer 
[14] show that this method often gives estimates which have 
a lower r-bias than those of certain (widely-used) parametric 
methods; using Monte Carlo simulated failure data, these ‘‘com- 
pletely monotone regression” estimates of current failure rate 
are also shown to be more robust than the estimates based on 
parametric models. 

Chan [4] has estimated the distribution of time-unul-next- 
failure for real data using completely monotone regression 
estimates of current reliability. He starts with a raw estimate 
which is an exponential distribution with the estimated current 
failure rate and then “adapts” it to a more general distribution 
using the procedure of Littlewood & Keiller [10]. Chan then 
evaluates these estimates using criteria of Abdel-Ghaly, Chan, 
Littlewood [1]. The Chan study shows that completely monotone 
regression gives good estimates that are more robust than 
estimates from parametric models. 

This paper extends the completely monotone software 
model by developing a method for providing long-range predic- 
tions of reliability growth, based on the model. The paper 
derives upper and lower bounds on extrapolations of the failure 
rate and the mean function. These are then used to obtain 
estimates for the future software failure rate and the mean future 
number of failures. 


001 8-9529/9 1/0800-0329S01. 00© 1991 IEEE 


330 


IEEE TRANSACTIONS ON RELIABILITY. VOL. 40. NO. 3, 1991 AUGUST 


2. NOTATION, DEFINITIONS, ASSUMPTIONS 
Notation 

a second order difference defined in (17) 

d order of the highest difference constraint (7) 

D(x,y) weighted squared distance between vectors x and y 
ij dummy indices 

k number of discrete time subintervals over the obser- 

vation interval [0,T] 

/ number of subintervals over the future prediction 

horizon 

M(t) mean number of failures observed in the interval [0,r] 
M{t) raw piecewise linear estimate of M{t) 

m t M(s t ) 

m, smoothed least squares estimate of M(s t ) 

n number of observed failures over the interval [0,71 

jV(/) random number of failures occuring in the interval 

[ 0 ,/] 

{A r (r),0</} stochastic point process of the number of failures 
p defined in (9) in terms of the failure rate, and in (24) 

in terms of the mean function 

P(j) air instance of a third order extrapolation of the failure 
rate — for j subintervals into the future 
q defined in (12) in terms of the failure rate, and in (27) 
in terms of the mean function 
r(f) dM(t)/dt, failure rate at time f, 

raw estimate of the failure rate over the subinterval i 
r i smoothed estimate (least squares estimate satisfying 
completely monotone constraints) of the failure rate 
over subinterval i 

Si id, the end point of subinterval i 

t time 

T length of the observation interval 
r ( occurence time of failure i 

u defined in (15) in terms of the failure rate and in (30) 
in terms of the mean function 
v defmed in (22) 

Wj weight assigned to subinterval i in the least squares 
equation 

5 adjustment to the number of failures in subinterval k , 
see (8) 

A 7 order j backward difference operator see (5) 

8 T/k , length of each of the k subintervals of [0, T ] ; also 

the length of the / subintervals in the future prediction 
horizon 

Other, standard notation is given in “Information for Readers 

6 Authors” at the rear of each issue. 

Definition 

A positive function r ( • ) is completely monotone if and only 
if it has derivatives of all orders, and they alternate in sign, see (2). 

Assumptions 

1, Parametric models are an approximation to the 
software-reliability growth process. In general, there is no “cor- 


rect” parametric Rliability-growth model. While a parametric 
model might work well on some failure-data sets, it might also 
give bad predictions for other data sets [1]. 

2. The “goodness of fit” to observed data and “quality 
of prediction” of&iture failure behavior are two distinct (not 
necessarily equivrfent) properties of reliability' growth models 
[!]■ 

Background Theory 

1. Under vesy general conditions, if the software usage 
is random and ti» homogeneous, and if faults are fixed im- 
mediately and peifcctly, then the reliability growth process has 
a completely moaotone intensity [12]. 

2. Converse!*, virtually all completely monotone functions 
can occur as the fatfure rate of reliability-growth processes [12]. 


3. FROBLEM FORMULATION 

Consider the failure data as in (1). Our goal is to find a 
completely mono»ne rate function and/or the associated mean 
function which best fits the data. The mean function does not 
strictly satisfy theiomplete monotonicity property; rather, M(t) 
is a nonegative fraction whose derivative d\1{t)/dt is a com- 
pletely monotone Sinction. Our approach is to obtain an initial 
raw r estimate for fie required function from the data, and then 
to smooth it by fi**ng a completely monotonic function which 
is closest to it in she least squares sense. 

A reasonableraw estimate M{t) for the mean function is 
a piecewise linear function with breakpoints at t it /= I,...,n, 
such that M(/,) =*: 

/,</<*, + 1 ; i — 1 

t n <t<T. 

(3) 

The second term* the final interval reflects the absence of a 
failure in the pei*d ( t n ,T ]. The choice of 5 is somewhat ar- 
bitrary, with higher values tending to give more conservative 
estimates. In this work we consider values of 0.0, 0.5, 1 .0 for 
6; however one cm argue for and against any particular value. 

In practice, i is necessary to discretize the problem of 
finding a compleariy monotone function to the mathematically 
more tractable p»blem of finding a finite set of points along 
that function. Themost plausible and straightforward approach 
is to consider diHrete time points which are equally spaced. 
We thus divide tbrtime interval [0,71 into * intervals of equal 
length 8 = T/k , aid define s t = id, i = 0,. . . ,k. Thus the sequence 
is an*itial estimator for the values of the mean 
function at the fi*d intervals j,. In general, however, this se- 
quence does not stisfy the complete monotonicity assumptions 
of the model, ani thus needs some modification. 

For the probfem of estimating the rate function, we obtain 
an initial estimafiar from the slope of M(t). Specifically, the 
sequence - ORIGSWL PAGE SS 

OF POOR QUALITY 


M(t) 


-E 


+5(t-t n )/a-t n ) 
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fi a (m, i = 1 * 

is a raw estimate of the failure rate at the points s r 

When working with discrete, equally spaced time points, 
the analogue of a completely monotone function is a complete- 
ly monotone sequence. The sequence ( r lt i — 1,2,...) is com* 
pletely monotone if — 

(-1)' A ; r ( >0,;>l<i;; = 0,l,... ( 4 ) 

where A ; is the order j backward difference operator: 


Return to the problem of estimating the mean function. Its 
first order derivative is completely monotone, and using the 
above, our problem is: 

t 

min D{m,m) = ^(m.-m,) 2 

/* 1 

subject to ( — l) rf+l A^ m ( >0, d<i<tk + l (8) 

( — 1 ) ; A ; mic+i- 0. 0 < j < d - 1 

^ 5/1 + 5, k > 0 

^, = 0 . 


A°r, = r,, A 1 A j r t = A ; r,-A ; r t _ { , j> l 


(5) 


In general, the initial estimate does not have the 

complete monotonicity property. Our goal is to find the 
“closest” completely monotone sequence and use 

it as an estimate of the sequence of failure rates at times s^. Us- 
ing the criterion of weighted least squares, the problem is to 
find a vector r which minimizes — 

k 

D(r,r) = £ w,(r,-r t ) 2 (6) 

subject to the complete monotonicity constraints of (4), where 
w t is a set of prespecified weights. 

Numerical experience indicates that the effect of the very 
high order difference constraints on the optimal solution is at 
most marginal; moreover, their presence leads to ill-conditioning 
of the optimization problem. Consequently we relax the con- 
straints in (4) and consider differences of at most d (not °°), 
with d being typically 3 or 4. Similarly, it is necessary to con- 
strain the sequence infinitely far into the future; we restrict the 
number of future intervals to /, rather than oo. Finally, many 
of the constraints in (4) are redundant, eg, Ar f _j<0 and A 2 
^>0 imply that A r ( > 0. Eliminating those redundant con- 
straints, we obtain the reduced system of equations — 

( — 1 ) d A d r,>0, d+ 1 </</; + / 

(7) 

(-l)'A' r k +[>0, Q<j<d- 1, 

and our problem is to minimize (6) subject to (7). 

For d - 1 , the problem is the well known 4 4 isotone regres- 
sion” (Barlow, et al [2]) and addressed in the reliability-growth 
context by Campbell & Ott [3], and Nagel, et al [17]. If the 
last interfailure happens to come from the right tail of the in- 
terfailure time distribution, t k underestimates r(F), and the 
monotone constraint on r has no effect; thereby leading to a 
negative bias. Imposing the additional constraint of convexity 
tends to pull this estimate up. In most software-reliability ap- 
plications, a positively biased estimate of the failure rate is safer 
than a negatively biased estimate; thus, higher order constraints 
seem to be desirable, and the generalization of isotone regres- 
sion to completely monotone regression is an improvement. 


If testing stopped at a failure, (r„ = r), then 5 = 0. For trun- 
cated testing however, 6 = 0.5 is more plausible. Using an argu- 
ment based on the assumption of Poisson process, 6 = I is also 
a plausible choice. 

The optimization problems in this section are linearly con- 
strained quadratic programming problems, and algorithms for 
their solution are readily available in the literature. However, 
our particular problem of least squares regression under higher 
order difference constraints becomes increasingly ill-conditioned 
as the problem size grows [15]. Thus, a numerically stable 
algorithm should be employed for its solution. For a detailed 
description of a viable solution approach, see [15]. 

An additional difficulty when attempting to include 
monotonicity requirements into the future, is that the Hessian 
matrix (the matrix of the second order derivatives of the objec- 
tive function) is singular since the predictions r t and m l (where 
js&+ l,... t k + l) do not appear in the objective. Moreover, the 
optimal future rate or mean estimators obtained by the least 
squares objective are not unique. Section 4 shows how to over- 
come the problem of singularity, by reformulating the con- 
straints on the future rates (or mean function estimates) in terms 
of those of the past. Surprisingly, this approach also provides 
bounds — lower and upper envelopes for these future estimates. 


4. PREDICTIONS 

Formulation (8) gives rise to some computational problems, 
when predictions are requested, ie, when />0. Algorithms for 
solving quadratic programming problems [11] usually require 
that the Hessian matrix of the objective function be positive- 
definite. However, the Hessian matrix of the objective for (6), 
(ie, diag (wj,... ,w*)) is only positive semi-definite, and does 
have singularities. As a result, not only do we encounter 
numerical difficulties when trying to solve the problem direct- 
ly, but the optimal solution is not unique. Indeed, any two solu- 
tion vectors where the first k components are equal, yield ex- 
actly the same objective value. In other words, if the complete- 
ly monotone sequence ( r { ,...,r k ) can be extrapolated / time in- 
tervals into the future, in a way that the resulting sequence 
(rj,...,r* +/ ) is completely monotone, then all such possible ex- 
trapolations have the same least squares objective. We show, 
that among all such extrapolations, there exist a globally highest 
and a globally lowest extrapolation, and all other completely 
monotone extrapolations into the future must lie in between the 
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highest and lowest bounds. We thus have an envelope in which 
all completely monotone extrapolations are restricted. In addi- 
tion, we derive the conditions under which the sequence 
(r } r k ) can be extrapolated as a completely monotone se- 

quence into the future. 

Consider the completely monotone sequence of order d: 

/?=(r! r t ). The sequence ,r 4+/ ) is defined to be 

a feasible completely monotone extrapolation of order d for /?, 
if the sequence (rj, ...,/*+/) is completely monotone up to 
order d , ie, it satisfies (7). In addition, this extrapolation con- 
stitutes an upper bound for all feasible extrapolations of order 
d , if any other such extrapolation, + satisfies 

r k ^ t <r k ^ t for 1 = 1,...,/. Similarly it constitutes a lower 
envelope if r k ^ t > r k + t for all i. We derive conditions for the 
existence for such higher and lower envelopes for the completely 
monotone extrapolations. 

For </= 1 and d = 2, the sequence (r u ... t r k ) can be ex- 
traplated into the future by letting /> + ,= r kt i= 1, This ex- 
trapolation is clearly the upper envelope for all completely 
monotone extrapolations of order 1 and 2, and is always feasi- 
ble. Also for d — 1, the extrapolation is clearly a lower 

envelope for all isotone extrapolations. The next proposition 
shows, that the lower envelope for feasible extrapolations of 
order d = 2 is along a piecewise linear function which has slope 
A 1 r k% until it reaches zero, after which it continues as a 
constant function zero. We define 


gilb(-r*/A' r t ) 

l 


.if A 1 r k > 0 
.if A 1 r k = 0 


(9) 


Proposition 1. Consider the constraints (7) with d — 2 and fix- 
ed />0, and let (r,,...,r t ) be a feasible solution to (7) with 
1 = 0. Then the extrapolation 


r k + i 


r k + /A 1 r k i=\,...,p 
0 i=p+l,...,l 


(10) 


is a lower envelope for all feasible extrapolations of order 2 
to (r, r k ). 

Proof. The solution above is clearly monotone, and A 2 
r t+( - = 0 for i=l,...,p and i=p + 3,...,/. In addition, A 2 
r k+p + \ = — ( r * + (P + UA 1 r k ) and A 2 r k+p+ 2 = r k +pA l r k , 
which, by definition of p are both nonnegative. Thus the con- 
straints of (7) for d = 2 are satisfied. Note also, that for any 
other feasible extrapolation (r* +1 r k+ i) we have 


A 1 r k+J > A' r k . 


Proposition 2. Consider the constraints (7) with d = 3 and fix- 
ed /> 0. A solution ( r, r k ) which satisfies (7) with 1 = 0 can 

be extrapolated to a vector (r )( ... ,r k+t ) which satisfies (7) with 
/> 0 if and only if: 

r k 4- jA 1 r* + y j{j+ 1)A 2 r k >0J=\ I (11) 

In addition, let — 

fgilb( — A 1 r k / A 2 r k ) if A 2 r k > 0 
q ~ [/ if A 2 r k = 0. 

Then the upper envelope of all feasible extrapolations for d = 3 
is: 


fr* + /A l r k + V4i(/ + DA 2 r k i = l q 

,M " U, <=«+• '• 

Proof: Any feasible extrapolation satisfies: 

i 

r k + i = r k + ^ ^ r k+j 


l 


= r k + 


i U '* + 1 

j = \ V h=\ 


A r k +h 


» J 


= r k + /A 1 r* + J] J] A J r k + h 

J = 1 h = 1 

< r k + (A 1 r k + -I- i (/ + 1 ) A 2 r k 


and the nonnegativity of r k + { implies that (11) must hold. 

Conversely, assume that (11) holds. Now since {r t } is 
completely monotone of order 3, the sequence {-A r,} is 
completely monotone with order d = 2. Using proposition 1 for 
the lowest feasible convex extrapolation for { - A 1 r t }, we ob- 
tain the upper envelope for completely monotone extrapolations 
of ( r if ...,r k ) of order 3. Q.E.D. 

Proposition 3. Consider the constraints (7) with 3=3 and fix- 
ed />0, and let (r lf ...,r*) be a solution to (7) with / = 0 satis- 
fying (11). Let p be defined as in (9). 

(a). If p>/, then the extrapolation — 


Thus, if i<p then — 

i 

P k + i = r k + ^2 A 1 F k+J 2: r k + /A 1 r k = r k+i . 

7 = 1 

It follows that (10) is a lower envelope as proposed. 

Q.E.D. 


rk+i = r k + /A 1 r k , i=\,...,p 

is a lower envelope for all feasible extrapolations of order 3 
to (r,,...,r t ). 

(b). If p<. I, let — 

u = min(/,l + gilb( - 2r t / A 1 r *))- 
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Then the extrapolation — 


r k + iA l r k + — 1(1 + 1) 


(■ 


-2(r t + «A 1 r t ) 

«(« + !) 


/= 

/ = « + 1 / 


If A 1 r* + /A 2 r*^0 then the upper envelope of ail such ex- 
trapolations is: 

r k+i = r k + iA x r 4 + y/(/+ l)A 2 r k , /=1 p. (16) 

Otherwise, let — 


(12) v = min(/,l +gilb( — 2A 1 r k / A 2 r k )). 


is a lower envelope for all feasible extrapolations of order 3 
to (r { r k ). 

The proposition states that the lowest envelope is a linear 
function with slope A 1 r k , provided that such a linear function 
is feasible (nonnegative); otherwise it starts as a quadratic func- 
tion with constant second order difference 

a _ ( 2 ( r * + “ A ’ r * ) \ 

which flattens to zero at r i + u , and from there continues as 
zero. 


Proof. Up>l then (10) is a feasible extrapolation of order 
3, thus a follows from proposition 1. Ifp</ then (10) does not 
satisfy the third order difference constraints. We now show that 
for this case, the function of (12) is a lower envelope for any 
feasible extrapolation. First assume that there exists a feasible 
extrapolation r k+v ..r k +i for which A 2 r k + x <a. Then 

r k + x + (u- DA 1 r 4+1 + y (u-l)wA 2 r k + x 


< r k + u A ^ r k + — k ( u + 1)A 2 r * + j 
= r k + uA ] r k -(r k + uA { r k ) = 0, 


in contradiction to the conditions given by proposition 2, for 
a feasible extrapolation for r lf ... f r*, r* +l . We therefore con- 
clude that any feasible extrapolation has a second order dif- 
ference of at least a. If, on the other hand, A 2 r k +\>a, then 
r* +l >r* +1 . An inductive argument starting from r Jfc+l com- 
pletes the proof. Q.E.D. 

Proposition 4 . Consider the constraints (7) with d~ 4 and fix- 
ed />0. A solution (r^...,/*) which satisfies (7) with / = 0 can 
be extrapolated to a vector (r lf ...,r i+i ) which satisfies (7) with 
/> 0 if and only if — 

A 1 r*+yA 2 r k + -—/(/ + 1)A 3 r k < 0, ; = 1 (13) 

r* + /A l — /(/ + 1 ) A 2 r k > 0, (14) 

r* + — (/-l) A 1 r k +^-j(J-\)A 2 r k > 0, ;'=1 /. (15) 

3 6 


Then the upper envelope of all such extrapolations is: 

r k + / A 1 r k + — i ( / + 1 ) A : 2 r k + — j ( / + 1 ) ( i + 2 ) 
2 6 


r * + r — 


/ -2(A'r 4 +vA 2 r t ) \ 
V v(v+l) ) 


\Jk + v 


/=1 V 

I = V+I /. 


(17) 


Proo/: If the sequence {r* + J } is a feasible extrapolation 
of order d = 4 then the sequence {-A 1 r k + t ) is a feasible ex- 
trapolation of order d = 3. By Proposition 2, the conditions for 
existence of the latter are given by (13). In addition, the upper 
envelope of all extrapolations for d = 4 is the sequence {/* + ,} 
for which { - A 1 /*+,} constitutes the lower envelope of all ex- 
trapolations of order c/ = 3. Applying Proposition 3 with respect 
to the sequence {-A 1 r**,} and integrating over this lower 
envelope yields the sequence of (16) and (17). Note that by con- 
struction, the resulting sequence is nonincreasing, convex with 
nonposilive third order difference. It remains to determine the 
conditions under which this sequence is nonnegative. First, we 
note that condition (14) guarantees that (16) will be nonnegative. 
From Proposition 2 this is also a necessary condition. Also con- 
ditions (15) guarantee that r k + v is nonnegative for any possi- 
ble value of v between 1 and /. Since (17) represents a decreas- 
ing function which becomes constant for i > v, this guarantees 
that r*+, is also nonnegative for any i. To show that conditions 
(15) are also necessary, define 

P(j) = r k +^-{j-\) A 1 r k +^-jU~ 1)A 2 r k . 

3 o 

It is easy to see that P(j) decreases fovj= l f ... ,v and increases 
for y = v,..,/. Suppose that (15) is violated for some j. Let j 
be the smallest index to violate this condition. It follows that 
f < v and that P(v) ^0. This in turn implies that r kJtv <, 0, and 
thus no feasible extrapolation with <f = 4 is feasible, hence a 
contradiction. ' Q.E.D. 

We now derive the envelopes for prediction for the mean 

function. Consider a sequence of order d: M= (mi m k ) 

which satisfies (8). The sequence (m* +l ,...,m t + / ) is defined 
to be a feasible extrapolation of order d for A/, if the sequence 
(mi,.. .,*!*+/) satisfies (8). In addition, this extrapolation con- 
stitutes an upper bound for all feasible extrapolations of order 
d y if any other such extrapolation (/?% + fr\+i) satisfies 
m k ^ i <m k ^ i for /si,...,/. Similarly it constitutes a lower 
envelope if m k + l '>m k + l for /= 1,...,/. 
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We derive conditions for the existence for such higher and 
lower envelopes for the feasible extrapolations for Af. The 
derivative of the mean function is completely monotone. 
Therefore, the lower and upper bounds for ail feasible extrapola- 
tions of order d to w, m 4 are obtained by integrating respec- 

tively over the lower and upper bounds for all feasible extrapola- 
tions of order ^ — 1 to A m x A m k . 

Consequently, for the case d = 1 , d = 2 and d = 3, the se- 
quence (m lt ...,m k ) can always be extrapolated into the future. 
The upper envelope for all feasible extrapolations of order up 
to 3 is the linear function: 

m k+i ~ + m k- 


+ , = 

r m k + iA ] m k + 1/2 /(/ + 1 )A 2 /n t + 1/6 /( / + 1 ) (/ + 2)A 3 m* 

* =l * 

\m k + q + (i-q)ot i = q + 1 /. 

Proof : Follows from proposition 2. Q.E.D. 

Proposition 7. Consider the constraints (8) with d = 4 and fix- 
ed /> 0, and let (m ]t ...,m k ) be a solution to (8) with / = 0 satis- 
fying (19). Let p be defined as in (18). 

a. If /?>/, then the extrapolation — 


For d = 1 and d = 2 the extrapolation m k + ( - m k is clearly a 
lower envelope for all feasible extrapolations. Proposition 5 
shows, that the lower envelope for feasible extrapolations of 
order d — 3 is along a quadratic which tapers off to a constant 
function. 


= m * + ' A ' "*• i= i ” 


is a lower envelope for all feasible extrapolations of order 4 
to 

b. If p</, let — 


Proposition 5. Consider the constraints (8) with </ = 3 and fix- u = min ( /, I -h gilb( — 2 A 1 m k /A 2 m k )). 
ed />0, and let (m } ,.,.,m k ) be a feasible solution to (8) with 

/ = 0. Let — Then the extrapolation — 


P = 


gilb(-A l m k /A 2 m k ), if A 2 m*>0 
/ , if A 2 m k — 0 


Then the extrapolation — 


(18) 


Mk + i = 


m k + iA ] m k + — i (/ 4- 1 ) A 2 ^* — / (/ + I ) (i + 2) 
2 6 


m k + i 


m t 4- /A* m k + l Ai(i+ 1)A 2 m k , /= I,.,.,/? 

^k+p . i=p+ 1,...,/ 


■( 


— 2(A I m t + aA 2 m*) 
«(u+l) 




is a lower envelope for all feasible extrapolations of order 3 
to ( m x ,...,m k ). 

Proof Follows from proposition 1. Q.E.D. 


L m * + U J-W+l,...,/ 

is a lower envelope for all feasible extrapolations of order 4 
to 


Proposition 6. Consider the constraints (8) with d — 4 and fix- 
ed />0. A solution ( m x ,.,.,m k ) which satisfies (8) with / = 0 
can be extrapolated to a solution f m k+l ) which satisfies 

(8) with / > 0 if and only if — 


Proof: Follows from proposition 3. Q.E.D. 

Proposition 7 states that the lowest envelope is either along 
a quadratic fiaction, or it starts as a cubic function which tapers 
off to a consent function. 


A 1 m k +jA 2 m* + — ;(/+ 1)A 3 m k >0, j = 1 /. 


(19) 


5. MONTE CARLO STUDY OF PERFORMANCE 


Let - 

_ fgilb(- A 2 m k / A 3 m k ), if A 3 m k > 0 
^ (J , if A 3 m k ~0 


a ss A 1 m*-f <?A 2 + -<7(9+ 1)A 3 m*. 

Then the upper envelope of all such extrapolations is: 


To get si idea of how well the prediction envelopes 
estimate futme behavior, we conducted a small Monte Carlo 
simulation experiment. Our goal is to estimate the number of 
events over some finite horizon. As in [13], we compare the 
completely monotone approach to some of the more popular 
parametric models. A value of d — 4 is used for the completely 
monotone estination (6 is taken as 1). Thus the least squares 
problem (8) s solved for d= 4, with the constraints of (19) 
replacing the constraints of (8) for i = k 4* 1 + /. Propositions 
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6 and 7 are applied to the resulting solution to obtain the upper 
and lower envelopes for the future mean function. Finally, we 
need a point estimate of the mean number of failures. We ar- 
bitrarily decided to use the midpoint of the envelope. 

Our choice of parameter models consists of three families 
of nonhomogeneous Poisson processes (NHPP). The mean func- 
tions of the NHPPs can have exponential, power or logarithmic 
form: 

AW') = -e-*), 

Mpow(0 = yt a , 

A/| Og (0 = 7log0&+l). 

Those models are fit to data by using the method of maximum 
likelihood [16]. Furthermore, we define a fourth model which 
is a mixture of the above three. It is fit by selecting the best 
fitting (maximum likelihood) of the three models. This is the 
“best” parametric model, among the three possibilities. 

We draw our data from 16 different Poisson processes. 
Each process is observed over the interval [0,100] and the future 
interval is [100, l25], — 25% into the future. We used £ = 20 
and / = 5. The 16 cases provide a variety of growth patterns. 
Each case is replicated 400 times. The cases are summarized 
in table 1. 


TABLE 1. Data Models (Poisson Processes). 

[All models are scaled so that E{N{ 100)) = M(100) = 40] 


Model 

Number 

Type of 
NHPP 

Parameter 

M(125>- 
Mf 100) 

l 

Homogeneous 


10.00 

2 

Power 

a = .749 

7,28 

3 

Power 

a - .557 

5.29 

4 

Power 

a 53 .410 

3.83 

5 

Power 

ar =* .296 

2.73 

6 

Power 

a * .208 

1.90 

7 

Logarithmic 

0 = .0124 

6.42 

S 

Logarithmic 

0 = .0429 

4.43 

9 

Logarithmic 

0 - .131 

3.16 

10 

Logarithmic 

0 * .461 

2.27 

n 

Logarithmic 

0 - 2.43 

1.62 

12 

Exponential 

=» .00808 

5.88 

13 

Exponential 

= .0167 

3.17 

14 

Exponential 

Tp = .0265 

1.47 

15 

Exponential 

- .0385 

0.54 

16 

Exponential 

Tf - .0550 

0.12 


The performance of the parametric models and the com- 
pletely monotone approach are summarized in tables 2 -4. Table 
2 shows the average prediction made by each model for the 400 
replicates of each case. Table 3 shows the average percentage 
error, or bias. Table 4 shows the root- mean-square percentage 
error for the 400 estimates made by each model for the 16 test 
cases. When the data come from a certain model, then that par- 
ticular model gives the best predictions. However in most cases, 


the completely monotone comes in as “second best”, ie, it gives 
better predictions than those given by using the incorrect 
parametric model. In practice, of course, it is highly unlikely 
that a parametric model used for prediction will indeed be the 
“correct” model from which the failure data were generated. 
Table 5 summarizes the performance of the prediction 
envelopes. The majority of the envelopes have zero width, ie, 
the upper envelope is identical to the lower envelope. 


TABLE 2. Average Predictions of Mean Number 
over Future Horizon 


Model 

Number 

True 

Mean 

EXP 

LOG 

POW 

BEST 

CM. 

Mdpt. 

1 

10.00 

8.67 

8.86 

9.33 

8.73 

9.52 

2 

7.28 

5.62 

6.11 

7.34 

6.38 

7.72 

3 

5 29 

2.97 

3.73 

5.36 

4.70 

5.88 

4 

3.83 

1.36 

2.23 

3.88 

3.62 

4.50 

5 

2.73 

0.51 

1.31 

2.76 

2.65 

3.40 

6 

1.90 

0.15 

0.75 

1.92 

1.87 

2.52 

7 

6.42 

6.10 

6.76 

8.20 

6.48 

7.44 

8 

4.43 

3.41 

4.68 

6.63 

4.19 

5.38 

9 

3.16 

1.64 

3.29 

5.27 

2.89 

4.06 

10 

2.27 

0.64 

2.35 

4.11 

2.28 

3.08 

1! 

1.62 

0.18 

1.66 

3.10 

1.71 

2.30 

12 

5.88 

5.86 

6.57 

8.09 

6.26 

7.31 

13 

3.17 

3.21 

4.70 

6.72 

3 67 

4.73 

14 

1.47 

1.51 

3.62 

5.63 

1.91 

2.84 

15 

0.54 

0.57 

2.95 

4.77 

0.75 

1 60 

16 

0.12 

0.14 

2.48 

4.07 

0.21 

0 88 


TABLE 3 

Percent Prediction Error (Bias) for Mean Future Number 

Model 

Number 


Fitted Model 



EXP 

LOG 

POW 

BEST 

CM 

1 

-13. 

-11. 

-7. 

-13. 

-5. 

2 

-23. 

-16. 

+ 1. 

-12. 

+ 6. 

3 

-44. 

-30. 

+ 1. 

-11. 

+ 11. 

4 

-65. 

-42. 

+ 1. 

-6. 

+ 17. 

5 

-81. 

-52. 

+ 1. 

-3. 

+ 24. 

6 

-92. 

-61. 

+ 1. 

-2. 

+ 33. 

7 

-5. 

+5. 

+ 28. 

+ 1. 

+ 16. 

8 

-23. 

+ 5. 

+ 50. 

-6. 

+ 21. 

9 

-48. 

+4. 

+ 67. 

-8. 

+ 29. 

10 

-72. 

+ 3. 

+ 81. 

0. 

+ 36. 

11 

-89. 

+ 3. 

+ 98. 

+ 6. 

+ 42. 

12 

0. 

+ 12. 

+ 32. 

+ 6. 

+ 24. 

13 

+ 1. 

+ 48. 

+ 112. 

+ 16. 

+ 49. 

14 

+ 3. 

+ 146. 

+ 282. 

+ 30. 

+ 93. 

15 

+ 6. 

+448. 

+ 788. 

+ 40. 

+ 199. 

16 

+ 14. 

+ 1921. 

+ 3216. 

+ 74. 

+615. 


Perspective 

We stress that some components in the formulation of the 
completely monotone model were chosen arbitrarily. Other 
definitions of the raw estimates and other objective functions 
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TABLE 4 acknowledges support of the US National Aeronautics and Space 

Percent Root Mean Square Error Administration Grant NAG-1-771, 

for Mean Future Number Prediction 


Model 

Number 



Fitted Model 



EXP 

LOG 

POW 

BEST 

CM 

1 

26. 

24. 

29. 

24. 

29. 

2 

39. 

32. 

23. 

31. 

30. 

3 

53. 

39. 

23. 

32. 

39. 

4 

69. 

46. 

23. 

29. 

48. 

5 

oo 

55. 

23. 

26. 

60. 

6 

93. 

62. 

23. 

26. 

73. 

7 

37. 

32. 

39. 

36. 

38. 

8 

44. 

31. 

59. 

43. 

50. 

9 

57. 

26. 

75. 

47. 

62. 

10 

75. 

23. 

89. 

40. 

74. 

11 

90. 

21. 

99. 

27. 

85. 

12 

38. 

34. 

47. 

37. 

45. 

13 

45. 

61. 

120. 

53. 

80. 

14 

54. 

155. 

292. 

82. 

142. 

15 

67. 

461. 

805. 

134. 

278. 

16 

95. 

1956. 

3268. 

336. 

776. 
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Zero 

Width 
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Non-zero 
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Av. width 
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True Mean Coverage 
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Overestimate 
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t j 
oo 
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will give different, and possibly better estimates. Nevertheless, 
the completely monotone approach shows a robustness not ex- 
hibited by the individual parametric models. The procedure has 
quite low bias, which is less than that caused by using the in- 
correct parametric models for prediction. Comparisons to the 
‘"best” parametric model are unfair because the Monte Carlo 
data are, in effect, drawn from that model. We could use other 
models to generate data for which this “best” parametric model 
is inferior to the more robust completely monotone approach. 
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ABSTRACT 

Wc address the problem of predicting future failures for a piiw of software. 
The number of failures occurring during a finite future time interval is 
predicted from the number of failures observed during an imial period of 
usage by using software reliability growth models. Two differed methods for 
using the models arc considered: straightforward use of individual models 
(simple models), and dynamic selection among models based <w goodness- of 
fit and quality-of -prediction criteria (super models). Pcrforrnmce is judged 
by the relative error of the predicted number of failures over fume finite time 
intervals relative to the number of failures eventually observed during the 
intervals. Six simple models and eight super models are evalwted based on 
their performance on twenty data sets. This study is h no means 
comprehensive. Some conclusions can be drawn . but many qen questions 
remain regarding the use and the performance of software re lability growth 
models. 


INTRODUCTION 

Software sometimes fails to perform as desired. These failuic may be due to 
errors, ambiguities, oversights or misinterpretations of tie specification 
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which the software is supposed to satisfy, carelessness or incompetence in 
writing code, inadequate testing, incorrect or unexpected usage of the 
software, or other unforeseen problems. All of these potential sources of 
failure create an environment of uncertainty for the behavior of any 
software: will the software fail or not? If so, when? Statistical modeling and 
analysis provide tools to investigate this phenomenon. 

A general goal is to understand, predict and control the uncertainty in 
software failure behavior. Statistical models and analysis can investigate 
various aspects of software and its failure, at different levels of detail. Our 
study treats a piece of software as a 'black box’ operating in a random 
environment. We ignore factors in the development of the software, the 
internal structure and functioning of the software, and details of the 
operating environment. In contrast, Eckhardt & Lee, 1 Littlewood 2 and 
Littlewood & Miller 3 present models that deal more closely with the 
structure of the software. 

In this paper we consider the sequence of times at which a piece of 
software fails. After each failure, the software is fixed so that (hopefully) it 
will not fail again from the same cause. From these data we want to predict 
future failure behavior. In particular, we will try to predict the number of 
additional failures which will occur during a future time interval of finite 
length. Our approach is to use ‘reliability growth models'. The questions are: 
'What is the best way to do this?’ and ‘How well do these models predict 
future failure behavior?'. Many reliability growth models have been 
proposed. For a given piece of software it is very difficult (perhaps 
impossible) to know which reliability growth model to use. (Iannino et al. 3 * 
give qualitative guidelines for choosing different software reliability growth 
models.) It is also difficult to know much about the accuracy of the 
predictions about future failures. Our study looks at these problems. 

We have taken failure data for 20 programs, fitted reliability growth 
models to initial segments of each data set. predicted the number of 
remaining failures in the data set, and computed the prediction errors. Our 
reliability growth models include several of the usual models in the literature 
and additional models that we call 'super models'. These super models are 
based on a set of the usual reliability growth models plus a selection criterion 
which identifies one of the set to use for predictions at each point of time; 
selection criteria may be based on 'goodness-of-fit' or 'quality-of-past- 
prediction' measures. We have tried to identify the best models or 
approaches, conditional on our 20 failure data sets. We cannot make any 
strong recommendations, but we do see that many of the models give useful 
predictions if only nominal levels of reliability are of concern. The major 
conclusion is that there are still important open questions in the area of 
reliability growth modeling and prediction. We hope that this paper will 
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Software reliability growth models 


serve as an example of an objective study of this important problem, and 
that more work will be done. 

An important negative conclusion can be drawn from any study of 
reliability growth modeling: only moderate levels of reliability can be 
treated. Extremely high levels of reliability such as those required in safety 
critical systems cannot be treated; see Miller . 4-5 Some software reliability 
growth models will occasionally predict that no failures will occur in the 
future; however, this cannot be done with levels of confidence required in 
safety critical software. Even the most casual examination of the numbers 
presented in this paper should lead the reader to that conclusion. 
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THE RELIABILITY GROWTH SCENARIO 

A system contains design flaws, each of which eventually manifests itself at 
some point in time, whereupon the system is redesigned in order to remove 
the design flaw. Design flaws are often called ‘bugs', and the time points of 



Fig. 1. Observed cumulative number of failures as a function of time. 
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manifestation mentioned above will be called 'failure times’. If the failure 
times are indexed chronologically they can be represented as 

0</, </ 2 </ 3 <U<**</ c I 1 ) 

where t c is the ’current’ time, the length of time that the system has been 
investigated, i.e. execution lime for software. A convenient way to 
graphically present these failure time data and stochastic processes is with a 
plot of cumulative number of failures versus cumulative lime: let 

/;(/) = max {/: r t < /} 0</</ c (-) 

be the sample path of such data as depicted in Fig. 1. If system redesigns 
successfully remove design flaws, system reliability will improve and eqn (1) 
should show a general pattern of stochastically increasing interfailure times, 
and plots of the cumulative number of failures in eqn (2) should show a 
positive but stochastically decreasing slope (negative second derivative). 
This phenomenon is called 'reliability growth’, i.e. the reliability of the 
system is improving as successive redesigns remove design daw's. We wish to 
make predictions about future behavior of the software, i.e. for t such that 
t c <t. 

RELIABILITY GROWTH MODELS 

It is convenient to consider eqn (1) as the realization of a random process: 

o<r,<r 2 <7- 3 <7;^-- (3) 

where the T's are random variables (the s are real scalars) and the process is 
observed for r.0<t<t c . The stochastic process of which eqn (2) is a 
realization is 

(,V(/) = max(r 7] < /), 0 < /} (4) 

The stochastic processes, eqns (3) and (4), are ‘reliability growth processes’. 
Numerous reliability growth models have been proposed for the analysis of 
software reliability. The first one specifically for softw-are was proposed by 
Jelinski & Moranda. 6 There are numerous surveys 7 ' 11 of the software 
reliability growth modeling literature. 

There tend to be three general classes of software reliability growth 
models: interfailure time models, order statistic models, and Poisson process 
models. Examples include the Littlewood-Verall model, 12 Littlewood’s 
Pareto model 13 and Duane’s power law nonhomogeneous Poisson 
process, 1415 respectively. (For general discussions of the interrelationships 
between these classes of models, some additional modeling considerations. 



99 



Software reliability growth models 


ilure 

(1) 

been 
/ to 
ith a 

( 2 ) 

>igns 
n (1) 
mes, 
)w a 
tive). 

the 
sh to 
that 


cess: 

( 3 ) 

ess is 
is a 

(4) 

sses'. 
sis of 
■d by 
ware 

owth 

acess 

sod's 

sson 

ships 

ions. 


derivations of these models from more basic principles, underlying 
assumptions and complications, see Gray 16 ' 17 and Miller. 18 ) All reliability 
growth processes can be thought of as consisting of noisy behavior around a 
smooth trend curve. One obvious way of describing the trend curve is with 
the average number of failures occurring by time f, i.e. the expected value of 
the number of failures, thus a trend curve for stochastic processes in eqns (3) 
and (4) is 

M(t) = £[N(0] 0</ (5) 

Several members of a logarithmic family of trend (or growth) curves are 
shown in Fig. 2. A rough approach to the prediction problem is to pick a 
member from a parametric family of growth curves (as in Fig. 2) w hich best 
fits some software failure data (as in Fig. 1) and then extrapolate along the 
curve to the right in order to predict the expected number of failures during a 
future time interval. In fact, most reliability growth modeling is equivalent to 
this kind of curve fitting. Sophisticated statistical techniques may be used to 



Fig. 2. Subset of mean functions for a parametric family of reliability growth models with 

logarithmic trend. 
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fit the models. But it had not been proven that these statistical techniques are 
superior to a simple qualitative ‘eye-ball’ fit. This fact should be kept in mind 
when interpreting the accuracy of the predictions from reliability growth 
models. In particular, these models are not refined enough to distinguish 
between whether there are zero bugs or one bug remaining in a piece of 
software. 

Another conclusion from the point of view that software reliability 
growth is noisy behavior around a growth curve can be used in defining a 
rich family of reliability growth models: we want a family which is 
characterized by the mean function, eqn (5), and we want a rich set of mean 
functions. (For a discussion of necessary and sufficient conditions for 
reliability growth mean functions, see Miller. 18 For a nonparametric 
approach, see Miller & Sofer. 19 ) An attractive family of stochastic processes, 
eqn (4), characterized by their mean functions are the nonhomogeneous 
Poisson processes (NHPP); Musa & Okumota 20 - 21 have promoted this idea. 
NHPPs have an independent Poisson number of failures in disjoint 

intervals: 


P{N(I + s)- \(t) = n) = t 


-i*f(i + j>- Mini 


( A7(/ + -v) — :V/(Q) n 

;i! 


0 < /, 0 < s', n = 0, 1, 2, 3, . . . (6) 


This characterizes the processes. 

We shall use six parametric families of NHPPs, characterized by their 
mean functions: 


Ml Power: 

M2 Exponential: 
M3 Logarithmic: 
M4 Pareto: 

M5 General Power: 
M6 Weibull: 


A/ ,(/) = yt a 
M 2 {t) = y(l —e~ nr ) 

M s (t) = y log(l + /?/) 
M 4 (r) = 7(1 -(1 + /*')"*) 
M 5 (/)=*/((i + ^r , -D 

M 6 (/) = y(l — exp(— if/ 3 )) 


0 < x < 1 

0 < rj 

o<p 

0 < a, 0 < P 
-l<x<0, 0 < P 
0 < x, 0 < ij, 0 < y 


The ‘Power’ law was first suggested in a reliability growth context by 
Duane 14 and specifically as a NHPP model by Crow. 15 The ‘Exponential’ 
law is the trend encountered in the Jelinski-Moranda 6 model and the trend 
of the Goel-Okumoto 22 NHPP software reliability growth model. The 
‘Logarithmic’ trend is used by Musa and Okumoto. (Figure 2 shows 
some of the mean functions for this family.) The ‘Pareto’ curve occurs in 
Littlewood’s 13 order statistic model. The ‘General Power curve arises 
naturally when considering order statistics of independent but non- 
identically exponentially distributed failure times. 18 The ‘Weibull’ NHPP is 
discussed by Musa & Okumoto, 20 Abdalla-Ghaly et a/., 7 Miller 1 and 
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others. Taken together, these parametric families include many of the 
reliability growth models proposed in the literature; see Miller 18 for plots of 
selected mean functions of these models. 

FITTING MODELS 


We fit the six NHPP models (Ml, M2, M3, M4, M5 and M6) to data in the 
form ofeqns(l) or (2) as depicted in Fig. 1. In effect, for each of the above six 
parametric families, we want to find the ‘best’ fitting curve. We use the 
method of maximum likelihood as suggested and described by Musa & 
Okumoto 20 for fitting these models to data consisting of single sample 
paths. For each parametric family we get maximum likelihood estimates 
(MLEs) of the appropriate parameters: x, /?, y, etc.; this gives us a curve 

m) o < t < / c (7) 

uniquely determining the MLE NHPP. To solve for the MLEs we used the 
Nelder-Meade 24 simplex search algorithm. We wanted a general algorithm 
to solve general MLE problems for this study; in practice one would want to 
devote more effort to finding the MLEs as Chan 25 does for some models. 

There is no unique definition of ‘best-fitting’. The best way to fit a 
stochastic process model to an observed realization of the process is an open 
question. As mentioned before, an ‘eye-ball’ fitting may work well. Least- 
square or Kolmogorov-Smirnov distances could be used. The definition of 
‘best-fitting’ is certainly dependent on how the fitted curve is to be used. In 
our context we could define 'best-fitting' as equivalent to ‘best-predicting’; 
Brocklehurst 26 has investigated this approach of fitting some reliability 
growth models by optimizing certain quality-of-prediction measures. 

We are faced with two problems: finding the best-fitting member of a 
given parametric family and choosing among the best-fitting from several 
parametric families. We have rather arbitrarily decided to use the MLE for a 
given family. To choose among different families we shall try several 
approaches: minimum Kolmogorov-Smirmov distance, maximum likeli- 
hood and three others (Retro-U, Retro-Y and Retro-PL) to be described 
later. 


PREDICTIONS 


anses 
non- 
-IPPis 
8 and 


Various predictions can be made from the fitted NHPP with mean function 
(eqn (7)): the expected number of failures during a future time interval 

(/, t 4- j]: M{t + s) — M{t) 
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the current failure rate, at time / c . 

m(t) = — M(i)|j= ( 

U5 

the time until a target failure rate 

r o : i Q = min {/: m (/) = r o} 

the distribution of the time until next failure from the current time 
/ c ; p t (.?) = l — P(0 failures in (r c , / c + j]) = l ~ exp(— [A/(r c + s) — A/(r c )]) 
and the density of time until next failure from current time 

A standard approach is to consider the modeling, fitting and prediction 
steps as separate activities. Since the ultimate goal is good prediction 
Abdalla-Ghaly el al ? argue convincingly that an integrated approach 
should be taken: they introduce the idea of a ‘prediction system’ which 
integrates the above three phases. We are taking such an integrated point or 
view in this paper. 


QUALITY OF PREDICTIONS 

We wish to evaluate the accuracy of the predictions of future failure 
behavior which we make. If we predict an observable quantity, we can wait 
and compare the observation with the prediction, and then compute a 
measure of discrepancy. When predicting the number of failures in finite 
future time intervals, the error is simply the difference between the predicted 
number and the observed number. For a given piece of software undergoing 
execution, failure and fix, as time passes we can make predictions up to 
various time horizons, then when that horizon is reached we can compare 
the prediction with observation. So for a given piece of software we can 
make a sequence of predictions which can be checked against observation; 
from this a measure of quality-of-prediction can be computed. 

When we use a reliability growth model to predict the distribution or the 
density of the time until the next failure, we must compare predicted 
distributions to observed times in order to get a measure of quality-of- 
prediction. The procedure is as follows: after each failure is observed, the 
model is fitted to the data observed, thus far giving a new estimated mean 
function. The mean function fitted to the first / observed failures is denoted 
as Mi (r), 0 <t. The estimates of the distribution and the density of the time 
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until the next failure are then computed: 

/• * ,( 5 ) = F,,(S) = 1 - exp ( - [A Uu + s) - Mm /, * , (*) = ^ * i is) > > l 

Then the next failure is observed, at time / 1 + l , so the interfailure time is 

- Y i + 1 ~ * i — h 

Thus we have a sequence of predictions of successive interfailure time 
distributions and a sequence of observed interfailure times. The goal is to 
evaluate how well the predictive distributions actually predicted the 
observed inlerfailure times. We would like a quantitative measure of quality* 
of-prediction. Littlewood and co-workers 7 25,27 have provided three such 
measures, which we now summarize. (These measures are used mainly for 
comparison purposes. It is difficult to interpret the deviations 
quantitatively.) 

The first quality-of-prediction measure is the Vplot': it is well known that 
U = F x {X) has a uniform distribution on the interval [0, 1]. Using this fact, if 
F irl () is the true distribution of X , , , then m 1 + , = ^* 1 (y,» 1 ) will be an 
observation from a l’[0, 1] distribution. Thus the empirical distribution 
formed from the us should be closed to that of £/[0, 1], If we observe « 0 + n 
failures, starting to make predictions after the /i 0 th failure, the plot of 

!(»„,, - 1 - '/(" + I ))- i — 1.2, 3 »; 

is the a-plot. The maximum deviation of the a-plot from the identity 
function is a measure of quality-of-prediction. 

The second measure of quality-of-prediction is the 'r-plot': if the 
predictive distributions are good the u's should look like a random sequence 
of independent C[0. 1] variables, and — log (1 — »fs like exponential 
random variates. In this case, let 

n 

log ( 1 - u nn + j) ’ ^ log ( 1 - u nn +j) i = 1. 2, 3 n 

/• i ;=i 

and plot the pairs {( v,, i/(n + 1)). i— 1, 2, 3 — ,/»). If the predictive 
distributions are good, this plot should be close to the identity function. A 
quantitative measure of the quality-of-prediction is the maximum deviation 
between the y-plot and the identity function. 

The third measure of quality-of-prediction is the prequenlial likelihood: 
based on Dawid's 28 generalization of likelihood to a sequential situation, we 
have 
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For comparison purposes, the best predictive system should have the largest 
prequential likelihood. 

For a detailed discussion of these threemeasures of quality-of-prediction, 
see Abdalla-Ghaly cl a/., 7 Chan 25 and Keiller ei at These three measures 
eive a dynamic real-time evaluation ofbow well a given parametric model 
has done predicting interfailure times up to the present. It would seem 
logical to calculate the next prediction from the parametric family which has 
performed best up to the present on theparticular software failure data set 
under consideration. These three measures give a basis for making this 
choice. 

There are other possible measures of quality-of-prediction. For example, 
one such measure could be based on past predictions of the number of 
failures to be observed in finite time intervals which have subsequently 
elapsed. 


SUPER MODELS 

We consider eight super models. A super model is a set of parametric 
reliability growth models and a selection criterion; for a given software 
failure data set and for a given time, the selection criterion chooses the 
parametric model in the set that is to be used for making predictions of 
future failure behavior. As time passes for a given data set, a given super 
model may change its choice of parametric family to use for predictions. 

Our procedure for a super model is asfollows: using maximum likelihood 
estimation, we fit all six of the parametric models (M1-M6). Next, the 
selection criterion picks one parametricclass based on the fitted models. The 
current fitted model of the chosen classis used for making predictions at the 
current time. 

We consider two goodness-of-fit criteria: Kolmogorov-Smirnov devia- 
tions between the fitted mean function and the sample path, i.e. 

sup n(/)| 

0<r <t c 

and the maximum likelihood of the fitted models. We comment that three- 
parameter models (M4, M5 and M6) should fit the data better than two- 
parameter models (M 1, M2 and M3), aad some correction should be made to 
the goodness-of-fit criterion to reflect this; see Akaike. 29 We do not pursue 
this here. It is another example of one of the open questions in this research 
area. 

We consider three pure quality-of-prediction measures: the u-plot, the y- 
plot and the prequential likelihood. Usahg these criteria requires fitting each 
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of the six NHPP models (M1-M6) after each failure and calculating the 
predictive distribution and density of the time until next failure. So these 
three super models require more computationally intensive implementation. 

Finally, we consider three hybrid super models. We use the three quality- 
of-prediction measures in a goodness-of-fit mode. At current time, r c , the six 
NHPP models (M 1-M6) are fitted to the data; each fitted model is then used 
retroactively to predict (‘retrodict’) the already elapsed interfailure time 
distributions and densities; from these retrodictions and the observed data, 
M-pIots, y-plots and prequential likelihood can be computed, and the best 
fitted model chosen from among the six simple models (M1-M6). 

We comment that it is possible to define other selection criteria. In a 
related piece of work, Littlewood & Keiller 30 and Chan 30a have shown how 
to improve the prediction of the time until next failure by adapting a 
reliability growth model, basing the adaption on past quality-of-prediction. 

To summarize, our eight super models are all based on six NHPP models 
(Ml, M2, M3, M4, M5 and M6). The selection criteria for the eight super 
models are: 

M7 Maximum likelihood 

M8 Minimum K-S distance 

M9 (/-plot 

M 10 E-plot 

Mil Prequential likelihood 

M12 Retro. L'-plot 

M13 Retro. I -plot 

M 14 Retro. PL 


EXPERIMENTS 


We investigate the performance of 14 reliability growth models: six simple 
models (M1-M6) and eight super models (M7-M14). We see how well the 
different models can predict the number of new failures manifested during 
finite future time intervals. 

We base our experiment on 20 sets of software failure data, denoted 
D1-D20 in Table 1. The first 15 data sets are the same as those used in 
performance experiments by Musa and co-workers. 202 1,31 with two 
modifications: D5 consists of only the first 288 interfailure times of Musa’s 32 
System 5 because a major code change occurred at that point; D 1 1 consists 
of the last 100 interfailure times of Sukert’s 33 data set because the data set 
was huge with several major changes. The data consist of execution times 
between successive failures. To give the reader a rough idea of the data, we 
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TABLE 1 

Summary of Software Failure Data Sets 


Original source 
and reference 


Designation in 
original source 


Designation in 
references 20,31 


Musa 32 


\ 

T! 

Musa 32 


2 

T2 

Musa 32 


3 

T3 

Musa 32 


4 

T4 

Musa 32 


5 

T5 

Musa 32 


6 

T6 

Musa 32 


27 

T16 

Musa 32 


40 

T 1 7 

Musa 34 


— 

T 1 8 

Musa 32 


17 

T19 

Sukert 33 


— 

t:o 

Musa 32 


— 

T21 

Miller 35 


ISEE-C 

T22 

Miller 35 


A EM 

T23 

Millcr 3j 


SMM 

T25 

Abdulla-Ghaly 

a at 1 

Fig. 2 

— 

A bda 11 a *G h a 1 \ 

ci ul.' 

Fig. 3 

— 

Mock 36 


A 

— 

Mock 36 


B 

— 

Mock 36 


C 

— 


present them in an aggregated form in Table 2: we split the total cumulative 
time for each data set into 10 equal intervals, and show' the cumulative 
number of failures occurring up to each of 10 elapsed time points. From 
Table 2 it is possible to construct very rough plots of reliability growth as in 
Fig. 1. The original raw unaggregated data are used to fit reliability growth 
models. 

The experiment is designed as follows. For each data set we select nine 
time points, equal to kj 10 of the total execution time for the entire data set, 
k = 1 , 2, 3 , ... ,9; for each of our six simple models (M 1-M6), we find the MLE 
and make predictions. We have 180 (= 9 x 20) data intervals; [0, (k/\0)Tj ], 
k = 1,2, 3,...,9,y = 1,2, 3 ,..., 20, where 7) 101 is the total execution time for 
data set Dj. Using maximum likelihood estimation, we fit the model Mi to the 
failure data observed from data set Dj in the interval [0, (A:/ 1 0) 7} t01 ], then the 
fitted model is used to predict the number of failures to occur in the future 

interval ((fc/10)7} lo \ 7} l0l ] f for *= 1,2,3 9, ; = 1,2, 3,. ..,20 and / = 

1, 2, 3, ,6. Let nfjy k) ec l ual the number of predicted failures for data set Dj 

in the time interval {(k/l0)Tj°\ 7} 101 ], predicted by model Mi fitted to data 
observed from data set Dj over the time interval [0,(fc/10)7} 101 ]. Let rt(J y k) 
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TABLE 2 

Cumulative Failures Occurring in Percentage of Total Time 


Qaia Elapsed percentages of time 


set 



10 % 

20% 

30% 

40% 

50% 

60% 

70% 

80% 

90% 

im 

Dl 

49 

74 

85 

93 

104 

114 

122 

128 

132 

136 

D2 

20 

28 

30 

41 

42 

46 

48 

50 

52 

54 

D3 

22 

24 

28 

30 

30 

33 

35 

35 

36 

38 

D4 

24 

37 

45 

50 

50 

50 

51 

51 

51 

53 

D5 

53 

103 

153 

172 

192 

235 

251 

264 

273 

288 

D6 

15 

26 

32 

33 

47 

58 

66 

68 

69 

73 

D7 

15 

23 

25 

28 

31 

33 

39 

40 

40 

41 

D8 

63 

75 

76 

78 

79 

79 

85 

89 

92 

101 

D9 

74 

103 

123 

137 

146 

146 

152 

155 

158 

163 

D10 

7 

16 

24 

27 

30 

33 

36 

36 

36 

38 

Dll 

28 

50 

54 

68 

79 

87 

93 

97 

99 

100 

Dl 2 

14 

20 

27 

33 

38 

50 

60 

65 

70 

75 

Dl 3 

23 

38 

61 

62 

73 

80 

98 

102 

110 . 

117 

DI4 

25 

48 

78 

89 

98 

127 

133 

142 

167 

179 

Dl 5 

44 

69 

95 

106 

129 

137 

162 

170 

185 

210 

D16 

15 

28 

39 

49 

54 

60 

68 

71 

75 

81 

Dl 7 

36 

74 

100 

117 

145 

158 

175 

189 

198 

207 

D18 

11 

20 

29 

31 

33 

38 

40 

4] 

42 

43 

D19 

21 

27 

32 

33 

34 

35 

36 

37 

38 

40 

D20 

*> 

4 

6 

10 

12 

14 

14 

14 

16 

17 


equal the observed number of failures for data set Dj in ((A* 10 )T}°\ T The 
prediction errors are 

k) = n,{j. k) - n(j, k) 
and the relative prediction errors are 

r,(j, k ) = (n,(j, k) - n(j. k))>n{j , k) 

/= l,...,6,y = 1,...,20 and A: = 1, — ,9 

(For the 20 data sets there happens to always be at least one failure in the last 
interval, so we avoid division by 0.) The prediction errors and the relative 
prediction errors for model M3 are tabulated in Tables 3 and 4, respectively. 

For each super model (M7-M 14) we have a selection criterion. Based on 
these criteria, each one of the super models chooses one of the simple models 
and makes a prediction. For / = 7, 8, 9 — , 14, j— 1,2, 3,. ..,20 and k = 

1, 2, 3 9. let Ci(j, k) equal the index of the simple model ( 1, 2, 3, 4, 5 or 6) 

that super model Mi likes the best for data from data set Dj over the interval 
[0, (A-/ 10)7}'°']; for example. Table 5 shows the choices made by model M8, 
which uses Kolmogorov-Smirnov goodness-of-fit as its selection criterion. 
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TABLE 3 

Prediction Errors, e(J,k) , for Model M3 


Data 



Percentage of lime 

elapsed 




set 

10% 

\o 

qN 

O 

r * 

30% 

40% 

50% 

60% 

70% 

S0% 

90% 

D1 

-22 8 

10 

-8-7 

-120 

-64 

-1-7 

08 

1-4 

04 

D2 

-2-7 

-4-7 

-110 

36 

-13 

05 

-02 

-0 4 

-0 3 

D3 

199 

-08 

05 

-0 2 

-2-7 

-0-7 

0 1 

-1-2 

-II 

D4 

25 1 

212 

200 

17*1 

9-5 

5-2 

34 

12 

-0 6 

D5 

2420 

2270 

222-0 

26-62 

0-8 

42-6 

23 4 

1 1-2 

05 

D6 

-365 

- 16 3 

- 18 6 

- 27 1 

-2 4 

14 8 

14 8 

5 2 

01 

D7 

1090 

108 

-08 

- 1-6 

-11 

-15 

3-8 

2-4 

0-5 

D8 

54 2 

20-6 

2*6 

-3-9 

-8-5 

- 12*7 

-9 1 

-7-6 

-6 9 

D9 

-19 5 

-11 

8-7 

1 2 7 

12 2 

26 

2-7 

0-4 

- 11 

DIO 

320 

420 

42 0 

16-0 

8-8 

68 

62 

2-3 

-02 

D 1 1 

179 0 

1034 

-4 1 

7-9 

1 19 

1 12 

96 

6-3 

25 

D12 

-53 4 

-45 3 

-34 4 

-27-5 

-240 

-2 9 

81 

2-5 

07 

D 1 3 

113 0 

-10 8 

60-4 

-17-7 

— 1 1*7 

-13-5 

51 

-16 

-04 

D14 

-115 9 

-611 

716 

- 23- 1 

-36-5 

65 

-13-5 

- 17 4 

08 

D15 

- 17 4 

-54 8 

-28-2 

— 48-2 

-26 0 

- 36 3 

- 14 1 

-20 0 

- 14 7 

D 1 6 

-0-7 

38 1 

13-2 

13 6 

0-7 

-0 6 

3 1 

-0 9 

-17 

D17 

- 123 1 

33 1 

4 4 

- 14 1 

164 

39 

70 

6-8 

22 

D 1 8 

-9-3 

10 7 

27-6 

6-3 

18 

5-2 

3 8 

19 

08 

D19 

3*8 

11 

4 8 

1-7 

0-2 

-0-6 

- 10 

-11 

- H 

D20 

- 10 1 

-8 1 

-40 

80 

70 

6-3 

09 

- II 

0 1 


Next, model Mi predicts that n,(j\k)~ n c (j k> {J,k) failures will occur for data 
set Dj over the time interval (( k , 10)7}'°', T- m ]. Errors are then computed as 
before; for example. Table 6 shows the relative errors for model M8. 

We wish to compute some summary statistics of how the different models 
perform over the different data sets, Djj= 1,2,3, . . 20. There does not seem 
to be any obvious best way to summarize the performance. It seems that 
relative error is preferred to absolute error in order to prevent one or two 
data sets with large numbers of failures dominating the summarizing 
statistics. Also, initially we want to summarize over independent test cases. 
Therefore we consider average relative errors (averaged over the 20 data 
sets) for each of the nine elapsed time percentiles. We consider tw r o averages: 


The average bias: 


=1 


r,(y, k)j 20 


The average deviation: 
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TABLE 4 

Relative Prediction Errors. r\j,k\ for Model M3 
Percentage of time elapsed 


>0% 


10% 

20% 

30% 

40% 

50% 

60% 

70% 

80% 

90% 

0-4 

Dl 

-0*26 

002 

-017 

— 0 28 

-0-20 

-0-08 

006 

0 17 

009 

0-3 

D2 

-008 

-018 

-0 46 

0*27 

-Oil 

0-06 

-0 03 

-0 10 

-0 14 

11 

D3 

1 24 

-005 

005 

-0 02 

-034 

-015 

003 

-039 

-0-57 

06 

D4 

0 87 

1 32 

250 

5-71 

3-17 

l 72 

1 70 

0-58 

-0-29 

0-5 

D5 

103 

1 23 

1 64 

023 

001 

080 

0-63 

0-47 

003 

0-1 

D6 

-0*63 

-035 

-0-45 

-068 

-009 

098 

211 

I 05 

001 

0-5 

D7 

4 19 

060 

-005 

-0-12 

-Oil 

— 0 19 

191 

2-44 

049 

69 

D8 

143 

0-79 

0 10 

-017 

-0-39 

-0-58 

-057 

— 0 63 

-0-77 

H 

D9 

-022 

-002 

0 22 

0-49 

0-72 

0 15 

025 

005 

-0-22 

0-2 

D10 

103 

1-91 

3 00 

1 45 

l 10 

1 36 

309 

l 14 

-010 

:-5 

Dll 

249 

207 

-009 

0 25 

0 57 

0-86 

1 32 

2 10 

2-53 

0 7 

D12 

-0-87 

-0-82 

-072 

-0-65 

-065 

-0-12 

054 

0 25 

0-14 

04 

D 1 3 

l 20 

-014 

1 08 

-0-32 

-0*27 

-0-36 

0-27 

-on 

-006 

0-8 

D 14 

-0-75 

-048 

0 71 

-026 

-045 

013 

-0 29 

-0-47 

007 

4 7 

Dl 5 

-010 

-0 39 

-025 

-0-46 

-032 

-0-50 

-0-29 

-0-50 

-0-59 

17 

D16 

-0 01 

0-72 

0 31 

0-43 

003 

-0 03 

024 

-0 08 

-0-29 

i-2 

D 1 7 

-0-72 

0 25 

004 

-0 16 

0 26 

008 

0-22 

0 38 

0 25 

OS 

Dl S 

-029 

047 

1 97 

0 52 

0 18 

1 05 

1 26 

0 96 

0 76 

H 

D 1 9 

0 20 

0 21 

060 

0-24 

003 

-Oil 

-024 

-037 

-0-55 

0 1 

D20 

-06/ 

-0-63 

-0-36 

114 

I 40 

2 11 

029 

-0 36 

nos 


These two average performance measures are tabulated in Tables 7 and 8. 
respectively. Finally, succumbing to the temptation to try to quantify the 
overall behavior of each model, we compute two grand averages: 


Grand average bias: 




E, = / 5 t (k) 9 


9 

Grand average deviation: cl t — l 3,{k):9 


These grand averages are tabulated in Table 9 for the 14 models. 

There are many other ways to summarize data. Musa and co-workers* 0 3 1 
used a normalized error (dividing the prediction error by the total number of 
failures in the data set); they then summarized by considering the median 
normalized error at each elapsed time point. The interested reader has 
probably already thought of other variations. One of the open problems in 
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TABLE 5 

Simple Models Chosen by Super-Model M8 


Data Percentage of time elapsed 

set 



10% 

20% 

30% 

40% 

50% 

60 % 

70% 

80% 

90°/ o 

D1 

MI 

Ml 

M6 

M6 

M6 

M6 

M6 

M6 

M6 

D2 

M6 

M2 

M2 

Ml 

M5 

M5 

M5 

MS 

MS 

D3 

M4 

M6 

M6 

M6 

M6 

M3 

M3 

M3 

M3 

D4 

M2 

M6 

M2 

M2 

M2 

M6 

M2 

M2 

M6 

D5 

M6 

M6 

Ml 

M2 

M2 

M2 

M2 

M2 

M2 

D6 

M6 

M5 

M5 

M3 

M2 

M2 

M2 

Ml 

MS 

D7 

NT 6 

M6 

M6 

M4 

M4 

M4 

M3 

M3 

M3 

D8 

M4 

M2 

M2 

M2 

M2 

M4 

M4 

M4 

M3 

D9 

M2 

M3 

M3 

M3 

M3 

M3 

M3 

M3 

M3 

DIO 

MI 

M6 

Ml 

M6 

M6 

M4 

M4 

M6 

M6 

Dll 

Ml 

M6 

M6 

M6 

M6 

Ml 

M3 

M3 

MS 

DI2 

M5 

M5 

Ml 

Ml 

Ml 

M4 

M3 

M2 

M2 

013- 

Ml 

M2 

Ml 

M6 

M6 

M6 

M3 

M4 

M4 

D 14 

M6 

M3 

M2 

M3 

M6 

Ml 

M3 

M3 

MS 

0 1 5 

M6 

M2 

M5 

M5 

M5 

M5 

M5 

MS 

M5 

D16 

V16 

M5 

M2 

M3 

M6 

M6 

M3 

M2 

M2 

D17 

M5 

Ml 

Ml 

M2 

M3 

M3 

M3 

M3 

M3 

D 1 8 

M6 

M5 

M5 

M2 

M6 

M2 

M2 

M2 

M6 

D19 

M6 

M4 

M3 

M6 

M6 

M6 

M6 

M6 

M6 

d:o 

M6 

M2 

M3 

Ml 

M4 

M4 

M2 

M6 

M6 

this research area 

is to 

identify the best 

ways to define 

and 

evaluate 


performance statistics. 


INTERPRETATION OF EXPERIMENTS 

There is a great temptation to end such an experimental investigation with 
the conclusion: The winner is model . . That would be very misleading for 
this type of experiment. The experiment is based on only 20 data sets. The 
statistical estimation methods, performance measures and general design of 
the experiment are quite arbitrary; other choices could be made. It is hoped 
that this experiment gives rough ideas of how reliability growth models 
perform, what can be expected of them, and of numerous open questions 
arising about their usage. However, there are several observations that can 
be made from this experiment. 

A cursory glance at the size of the errors in Tables 3, 4, 6, 7, 8 and 9 leads us 
to the conclusion that the predictions have some value. They are not 


TABLE 6 

Relative Prediction Errors, r[j\k), for Mode! M8 
Percentage of time elapsed 



10% 

20% 

30% 

40% 

50% 

60% 

70°/o 

80% 

90% 

D1 

0 91 

I 09 

-0 44 

-0-62 

-0*32 

-003 

0 12 

020 

064 

D2 

-0*76 

-086 

-0*96 

1-43 

006 

0-34 

0 15 

004 

-0-02 

D3 

1 24 

-0*95 

-0-74 

-0*91 

— 0 91 

-0-15 

003 

-0*39 

-0-57 

D4 

007 

0-95 

0 23 

0*87 

-043 

-0-81 

-082 

-093 

-098 

D5 

7-38 

-0 64 

l 64 

-100 

— 3 00 

0 55 

028 

007 

-050 

D6 

-099 

0 22 

-005 

-068 

0 88 

N6 

226 

1 43 

028 

D7 

6 72 

-099 

— 0 97 

— 0 89 

-0-93 

-096 

191 

244 

0-49 

D8 

-100 

-089 

-1 00 

-0-25 

-0 45 

-0 99 

-1 00 

-1 00 

- 1 00 

D9 

-0*96 

-002 

022 

0-49 

0*72 

0 15 

025 

005 

-0-22 

DIO 

103 

-0 92 

300 

-0-76 

-0*72 

1 32 

300 

066 

-0*92 

Dl I 

2 49 

-0*85 

— 1 02 

-0-74 

-0-30 

2-04 

1*32 

2 10 

367 

DI2 

-063 

-0 62 

-050 

-0-48 

-0 51 

-0-83 

054 

028 

0 14 

D13 

1 20 

-0 50 

1 36 

-0 93 

-0-64 

-0-65 

027 

-Oil 

-0 06 

D 14 

-0 96 

-0*48 

0 72 

— 0 26 

-065 

029 

-029 

-0-47 

022 

D! 5 

-0 96 

-0-78 

-0 25 

-0 44 

-021 

-0*48 

-016 

-0*44 

—0 53 

DI6 

-0-77 

0-85 

0 12 

0 43 

-0-37 

-0*32 

024 

-0-08 

-029 

D 1 7 

-0-59 

062 

054 

-0-48 

026 

0-08 

022 

038 

025 

DI8 

-0 81 

1 30 

2 66 

-020 

-0 60 

026 

027 

-0-05 

-0 14 

DI9 

-065 

0 2 1 

0 53 

-0-62 

-0*75 

-079 

-0-79 

-0 79 

-082 

D20 

- 1 00 

-0 93 

-036 

M4 

1 40 

1 76 

005 

-0X9 

-038 


TABLE 7 

Average Relative Bias of Predictions, h 
Percentage a] time elapsed 



10% 

20% 

30% 

40% 

50% 

60% 

7 0°o 

80% 

Ml 

1*56 

1 24 

125 

1*32 

097 

099 

1*38 

099 

M2 

-003 

-023 

-006 

-035 

-039 

-0*15 

0*05 

-023 

M3 

0-45 

0-33 

048 

038 

023 

036 

062 

033 

M4 

0*30 

-0-06 

-0 01 

-003 

-018 

-0*29 

-020 

-043 

M5 

l 05 

084 

091 

0-94 

047 

0*58 

092 

065 

M6 

112 

-0 15 

-012 

-024 

-044 

-0*18 

— 0 14 

-008 

M7 

0 61 

021 

033 

028 

009 

024 

060 

028 

M8 

0*55 

-0*21 

024 

-024 

-031 

010 

039 

006 

M9 

025 

024 

-0 04 

o-os 

-007 

0 19 

039 

038 

M 10 

1 56 

064 

065 

0*11 

006 

035 

076 

049 

Mil 

0*82 

049 

077 

029 

0*10 

027 

0*47 

050 

M 12 

025 

-0 12 

008 

OI7 

-0 02 

-0 01 

023 

027 

M 13 

1 56 

046 

077 

013 

-0*13 

019 

053 

0 01 

M 14 

082 

Oil 

032 

-OI7 

-034 

0*15 

0*27 

001 
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TABLE 8 

Average Relative Deviations of Predictions. J 


Model Percentage of time elapsed 



10% 

20% 

30% 

40% 

50 % 

60 % 

70% 

80 % 

90% 

Ml 

1 66 

131 

1 30 

141 

I 04 

1 06 

1 41 

I 09 

070 

M2 

1 03 

081 

0*91 

067 

059 

066 

066 

057 

047 

M3 

0 91 

063 

0*74 

069 

0 52 

0-57 

0 77 

063 

040 

M4 

1 03 

0-87 

0 91 

090 

0*72 

0 74 

0*79 

065 

060 

M5 

1 39 

1 03 

1 06 

M2 

063 

071 

1 00 

089 

059 

M6 

225 

094 

082 

084 

064 

0*67 

056 

089 

1 26 

M7 

113 

090 

089 

0*80 

0*76 

094 

1 26 

089 

079 

M8 

1 56 

0 73 

0 87 

068 

0 57 

0-70 

070 

064 

057 

M9 

1 22 

0S7 

0*77 

0*66 

057 

0-74 

078 

085 

1 09 

MIO 

1 66 

1 05 

116 

0 6! 

0 54 

0 78 

1 05 

091 

076 

Mil 

1 48 

0-87 

116 

0*63 

0*39 

0*64 

077 

090 

1-07 

M 12 

122 

0*76 

097 

0 91 

075 

067 

072 

OS2 

049 

\1 1 3 

1 66 

083 

1 29 

0*75 

044 

0*72 

083 

060 

0 60 

M 14 

1 48 

0*72 

1 03 

0*65 

0 57 

0-62 

075 

062 

043 


extremely accurate, but they are good enough to be helpful in some 
situations. It appears that reliability growth estimates may be useful for 
forecasting future maintenance activities on moderately reliable software. 
(For a successful application of software reliability growth modeling to 
certifying software, see Currit el al. 31 ) 

This experiment is strongly conditional on the data sets used. If one of the 
simple models (M1-M6) was truly a superior fitting model, we would expect 
that model to perform best, and we would expect super models (M7-M14) 
with good selection criteria to consistently choose that simple model. The 
fact that this is not happening suggests that none of the simple models is a 
superior fit. Furthermore, the super model approach is not an improvement. 
This needs further study, perhaps in a more controlled experiment based on 
Monte Carlo data. 

There is an interesting trade-off between the two-parameter simple models 
(Ml, M2 and M3) and the three-parameter models (M4, M5 and M6). The 


TABLE 9 

Summary Performance Measures for 14 Models 



Ml 

M2 

M3 

M4 

M5 

M6 

M7 

M8 

M9 

M10 

MU 

M 1 2 

M13 

M14 

B 

114 

1 

p 

o 

036 

-016 

0-74 

O01 

030 

006 

022 

054 

0*48 

007 

039 

0*10 

J 

1 22 

071 

065 

080 

0-93 

098 

093 

0 78 

083 

096 

0*88 

081 

086 

076 
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three-parameter models should generally fit better because they have richer 
flexibility. However, it is more difficult to fit a higher parameter family, 
especially with a general-purpose search algorithm like Nelder-Meade; so 
we may not alw ays be getting the best-fitting member of a three-parameter 
family. Thus the three-parameter simple models may not be performing as 
well as they might with a perfect search algorithm. 

There is also a trade-off between 'goodness-of-fit‘ and 'quality-of- 
prediction'. A model that fits the observed data well is not guaranteed to give 
good predictions into the future, especially if it is a rich family parameterized 
with several parameters. (Akaike 29 suggests handicapping parametric 
families based on the number of parameters.) The idea of an integrated 
'prediction system’ 7 suggests that 'quality-of-prediction' measures should be 
used for fitting models. However, in our experiment there is not an obvious 
difference between the super models based on the two different concepts. 

There is an effect caused by how the prediction errors are measured. A 
model may overestimate the number of future errors by any amount, but it 
may underestimate the number by at most 100%. This means that a few wild 
overestimates will hurt the average performance (Tables 7, 8 and 9) much 
more than wild underestimates (which is probably a more serious error). 
This may be making model M2 (which tends to underestimate) appear better 
than it really is and model M 1 (which tends to overestimate) appear worse 
than it really is. A more reasonable error summary might weight an 
'underestimate by one-hall" equivalent to an ‘overestimate by two-fold', for 
example. There are other possibilities: in fact, how to evaluate and 
summarize performance is an open question. 

Model M3 seems to be doing slightly better than all others. This model 
also performed the best in Musa & Okumolo's 20 performance studies using 
different summary statistics (median normalized prediction errors). 
Nagel J8,J9 and others have observed a log-linear pattern among occurrence 
rates of bugs in a program and hy pothesize that this may be a frequently 
occurring pattern; Miller 18 has shown that this pattern is approximately 
modeled by model M3. Another interesting fact is that model M3 plays a 
central role among the models M 1-M5; see Miller. 1 8 So it is not surprising 
that model M3 scores best in Table 9. But. of course, it is all conditional on 
the 20 data sets. 

Phillips (see Adams' 10 ) has observed that the occurrence rates for bugs in 
some large operating systems show a power law pattern which is equivalent 
to model M 1 (see Miller 18 ), but model M 1 does not perform well for our 20 
data sets. For different sets of data the performance of Ml and M3 might be 
reversed. This is why we want the freedom to pick the best model for each 
piece of software. These experiments imply either that this is impossible or 
that we have not figured out how to do it yet. 
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Looking at the prediction errors at the 90% elapsed lime point in Tables 3, 
4 and 6 reveals moderately sized errors. This should be a fairly easy 
prediction problem: we are predicting for a future time interval equal in 
leneth to A of the previous observed interval. From these moderately sized 
errors we conclude that it is not reasonable to ask these reliability growth 
models to accurately predict that software will perform error free for long 
future time intervals. 



NON-APPLICABILITY TO SAFETY-CRITICAL SOFTWARE 

Safety-critical software must be extremely reliable. The question is how to 
achieve extremely high levels of reliability. The reliability growth scenario 
would start with faulty software. Through execution ot the software, bugs 
are discovered. The software is then modified to correct for the design flaws 
represented by the bugs. Gradually the software evolves into a state of 
hieher reliability. There are at least two general reasons why this is an 
unreasonable approach to highly reliable safety-critical software. The time 
required for reliability to grow to acceptable levels will tend to be extremely 
long. Extremely high levels of reliability cannot be statistically guaranteed 

^For a discussion of the limitations of the statistical approach to high 
reliability, see Miller. 4 5 For a good discussion about the reliability growth 
scenario, see Gray. 161 ' Gray points out many aspects of reliability growth 
some of which are difficult to quantify and thus ignored by the usual 
reliability growth models; ignoring these aspects may not lead to 
unacceptable results when dealing with nominal levels of reliability, but they 
cannot be ignored when dealing with extremely high levels of reliability. See 
Hamlet 4142 for discussion of some additional complications. 


CONCLUSIONS 

We conclude that reliability growth models are useful for predicting the 
number of failures over finite future time intervals when we are dealing with 
software which is of low or moderate reliability. Maintenance of large 
moderately reliable software systems might be usefully predicted by these 

There are numerous open questions about softw-are reliability growth 
models. When making predictions into the future, it is very important to use 
a good model; how to choose the best model is an open question. The 
quantification of prediction errors (by confidence intervals or other 
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methods) is yet to be solved. The best ways to evaluate performance of 
models have not been identified. 

Our experiment shows that an apparently reasonable way to improve 
reliability growth modeling prediction based on super models results in no 
improvement. This may be due to the particular data sets we used or to other 
factors mentioned in the paper. A controlled Monte Carlo study may be 
useful in answering these questions. Regardless, the experiment reveals some 
of the problems arising in reliability growth modeling. 

Through this experiment and the errors calculated, we have tried to 
convey a rough idea of how well software reliability growth models perform. 


ACKNOWLEDGEMENTS 

P. A. Keitler thanks Fred Waters for many helpful discussions. D. R. Miller 
gratefully acknowledges research support from the National Aeronautics 
and Space Administration, Grant NAG 1-771. 


REFERENCES 

1. Eckhardt. D E & Lee, L. D. A theoretical basis lor the analysis of mullivcrsion 
software subject to coincident errors. IEEE Transactions on Software 
Engineering, SE-12 ( 1985) 1 51 1 -16. 

2. Littlewood. B . A reliability model for systems with Markov structure. Applied 
Statistics, 24 ( 1 975) 172-7. 

3. Littlewood, B. & Miller, D. R„ Conceptual modelling of coincident failures in 
multiversion software. IEEE Transactions on Software Engineering (in press). 

3a. lannino. A., Musa, J. D., Okumoto, K. & Littlewood. B.. Criteria for soft ware 
reliability mode! comparisons. ACM Sigsoft Software Engineering .Votes 8(3) 
(1983)12-16. 

4. Miller. D. R.. Making statistical inferences about software reliability. 1986 Joint 
Statistical Meetings Invited Paper. Chicago. August 1986. Available as CR- 
4197, National Aeronautics and Space Administration. December 1988. 

5. Miller. D. R„ The role of statistical modeling and inference in software quality 
assurance. In Software Certification, ed. Bernard de Neuman. Elsevier Applied 
Science. London. 1989, pp. 135-52. 

6. Jelinski. Z. & Moranda, P. B., Software reliability research. In Statistical 
Computer Performance Evaluation, ed. W. Freiberger. Academic Press. New 
York. 1972, pp. 465-84. 

7. Abdalla-Ghaly, A. A., Chan. P. Y. & Littlewood, B.. Evaluation of competing 
reliability predictions. IEEE Transactions on Software Engineering SE-12 
(1986)950-67. 

8. Dale. C. J.. Software reliability evaluation methods. ST-26750, British 
Aerospace. September 1982. 


1 16 


Peter A Keiller , Douglas R Miller 


9 Dale, C. J., Software reliability models. In Software Reliability: Slate of the Art 
Report 14:2. ed. A. Bendell & P. Mellor. Pergamon Infotech, London, 1986, 

pp. 31-44. J . . 

10. Farr, W. H., A survey of software reliability modeling and estimation. AD- 
A 154.874, Naval Surface Weapons Center, Dahlgren, VA. 1983. 

1 1. Goel, A. L.. Software reliability modelling and estimation techniques. RADC- 
TR-82-263. Rome Air Development Center, Griffiss Air Force Base, New \ ork, 

1 982 

12. Littlewood, B. & Verrall, J. L., A Bayesian reliability growth model for 
computer software. Journal of the Royal Statistical Society . Series C (Applied 
Statistics). 22 (1973) 332-46. 

13. Littlewood. B.. Stochastic reliability-growth: a model for fault-removal in 
computer-programs and hardware-designs. IEEE Transactions on Reliability , 
R-30 (1981) 313-20. 

14. Duane. J. T., Learning curve approach to reliability monitoring. IEEE 
Transactions on Aerospace, AS- 2 (1964) 563-6. 

15. Crow, L. H. Reliability analysis for complex repairable systems. In Reliability 
and Biometry Statistical Analysis of Lifelength, ed. F. Proschan & R. J- Serflmg. 
SIAM, Philadelphia, PA, 1974. pp. 379-410. 

1 6. Gray, C. T., Superposition models for reliability growth. PhD thesis, University 
of Birmingham, 1985. 

17. Gray. C. T.. A framework for modelling software reliability. In Software 
Reliability: Slate of the Art Report 14:2. cd. A. Bendell & P. Mellor. Pergamon 
Infotech. London. 1986, pp. 81-94. 

1 8. Miller, D. R.. Exponential order statistic models of software reliability grow th. 
CR-3909, National Aeronautics and Space Administration, July 1985. 
(Abridged version: IEEE Transactions on Software Engineering , SE-12 (1986) 

1 2-24.) 

19. Miller. D. R. & Sofer, A., A nonparametric approach to software reliability 
using complete monoionicily. In Software Reliability: State of the Art Report 
14:2, ed. A. Bendell & P. Mellor. Pergamon Infotech. London. 1986. pp. 183-95. 

20. Musa. J. D. & Okumoto, K., A logarithmic Poisson execution time model for 
software reliability measurement. Proceedings of the 7th International 
Conference on Software Engineering. IEEE Computer Society Press, Washing- 
ton, DC, 1984, pp. 230-8. 

21. Musa, J. D. & Okumoto, K., A comparison of time domains for software 
reliability models. Journal of Systems and Software , 4 (1984) 277-87. 

22. Goel, A. L. & Okumoto, K„ Time-dependent error-detection rate model for 
software reliability and other performance measures. IEEE Transactions on 
Reliability R-28 (1979) 206-11. 

23. Okumoto, K„ A statistical method for software quality control. IEEE 
Transactions on Software Engineering, SE-11 (1985) 1424-30. 

24. Nelder, J. A. & Mead, R„ A simplex method for function minimization. 
Computer Journal, 7 (1965) 308-13. 

25. Chan, P. Y„ Software reliability prediction. PhD thesis. City University, 
London, 1986. 

26. Brocklehurst, S.. Private communication, 1988. 

27. Keiller, P. A., Littlewood, B„ Miller, D. R. & Sofer, A., Comparison of software 
reliability predictions. I3th International Symposium on Fault-Tolerant 


Software reliability growth models 


117 


' Art 
986, 


AD- 


DC- 

ork, 

for 

ilied 

il in 
ilit\\ 

EEE 


nitty 

ling. 

rsity 


ware 

non 

wth. 

9S5. 

9S6) 

Mlity 

port 

95. 

! for 

»Ull 

ang- 

vure 

for 
v on 


EEE 

ion. 

sity, 


Computing* Digest of Papers . IEEE Computer Society Press, Washington, 1983, 
pp. 128-34. 

28. Dawid, A. P, Statistical theory: the prequential approach. Journal of the Royal 
Statistical Society , A , 147 (1984) 278-92. 

29. Akaike, H., Prediction and Entropy . Mathematics Research Center, University 
of Wisconsin, Madison, Wisconsin, USA, June 1982. 

30. Littiewood, B. Sc Keiller, P. A., Adaptive software reliability modelling. Nth 
International Symposium on Fault-Tolerant Computing, Digest of Papers. IEEE 
Computer Society Press, Washington, 1984, pp. 108-13. 

30a. Chan, P. Y, Adaptive models. In Software Reliability: State of the Art Report 
14:2 , ed. A. Bendell Sc P. Mellor. Pergamon Infotech, London, 1986, pp. 3-18. 

31. Musa, J. D., lannino, A. & Okumoto, K., Software Reliability: Measurement, 
Prediction , Application. McGraw-Hill, New York, 1957. 

32. Musa, J. D., Software reliability data. Data and Analysis Center for Software, 
Rome Air Development Center, Rome. New York, 1979. 

33. Sukert. A. N., A software reliability modeling study. Rome Air Development 
Center. Technical Report RADC-TR-76-247, Rome, New York, 1976. 

34. Musa. J. D , Private communication, 1988. 

35. Miller, A. M. B , A study of the Musa reliability model. MS thesis, University of 
Maryland, College Park, MD, 1980. 

36. Moek, G., Comparison of some software reliability models for simulated and 
real failure data. 4th 1ASTED (International Association of Science and 
Technology for Development) International Symposium and Course ‘Model- 
ling and Simulation', Lugano, Italy, 21 -24 June 1983, 

37. Currit, P. A., Dver. M. Sc Mills, H. D., Certifying the reliability of software. IEEE 
Transactions on Software Engineering , SE-I2 (1986) 3-11. 

38. Nagel. P VI. & Skrivan, J. A., Software reliability: repetitive run experimenta- 
tion and modeling. NASA CR-165836. 1982. 

39. Nagel, P. M.. Seholz, F. W. & Skrivan. J. A., Software reliability: additional 
investigations into modeling with replicated experiments. NASA CR- 1 72378 
1984. 

40. Adams. E. N., Optimizing preventive service of software products. IBM Journal 
of Research and Development , 28 (1984) 2-14. 

41. Hamlet. R. G.. Probable correctness theory. Information Processing Letters , 25 
(1987)17-25. 

42. Hamlet. D. & Taylor, R., Partition testing does not inspire confidence. 
Proceedings of Second Workshop on Software Testing , Verification , and Analysis. 
IEEE Computer Society Press, Washington, DC, 1988, pp. 206-15. 


vare 

rant 


Appendix 3 


M. Lyu, H. Hecht, H. Kopetz, D. Miller, J. Musa, M. Ohba, and D. 
Siefert "Research and Development Issues m Software Reliability 
Engineering," Proceedings of the IEEE International Symposium on 
Software Reliability Engineering (1991); 80-89. 

(Reprinted in Software Engineering Notes 16,2 (1991): 23-30.) 


PANEL: RESEARCH AND DEVELOPMENT ISSUES 
IN SOFTWARE RELIABILITY ENGINEERING 


Panel Chair: Michael Lyu (University of Iowa) 


Panelists: Herbert Hecht 

Hermann Kopetz 
Douglas Miller 
John Musa 
Mits Ohba 
David Siefert 


(SoHaR Inc.) 

(Technical University of Vienna) 
(George Mason University) 
(AT&T Bell Labs.) 

(IBM Corporation) 

(NCR) 


Introduction 

Michael R. Lyu, University of Iowa 

Computers arc bringing revolutionary changes to our 
life with their involvement in most human-made sys- 
tems for sensing, communication, control, guidance and 
decision-making. As the functionality of computer 
operations becomes more essential and complicated in 
the modem society, the reliability of computer software 
becomes more important and critical. 

Research activities in software reliability engineering 
have been vigorous in the past 20 years. Numerous 
statistical models have been proposed in the literature 
for the prediction and estimation of software reliabil- 
ity, and many research efforts and paradigms have been 
conducted for the design and engineering of reliable 
software. However, there seems to be a gap in between 
the achievements of software reliability research and 
the results from software reliability practice. We keep 
on hearing troublesome software projects, horrible 
software failures, and misconceptions in software reli- 
ability applications. 

It is the purpose of this panel to bring together 
researchers and practitioners of this field to discuss 
the software reliability problems which will have 
tremendous impact to our daily life. The panel is 
expected to raise research and development issues 
under this concern, to address existing and potential 
problems, to resolve some misunderstandings and 
conflicts, and to reach a fundamental basis for the 


advancement of this field. 

The panelists arc invited to discuss those topics includ- 
ing, but not limited to. the following: 

(1) What arc the most urgent needs for software relia- 
bility practitioners? 

(2) What kind of issues practitioners would like 
researchers to pursue? 

(3) Did practitioners get satisfactory results from 
software reliability researchers? 

(4) What arc the most challenging software reliability 
issues researchers are facing today? 

(5) Did researchers gain enough support to perform 
software reliability research? 

(6) What kind of inputs or feedbacks researchers arc 
seeking from practitioners? 

(7) What practices should be developed and con- 
ducted based on the current research results? 

(8) What is the gap in between software reliability 
modelers and measurers? How to abbreviate it? 

(9) What kind of multi-institutional efforts have 
been, or should be conducted for acquiring 
software reliability standards, handbooks, bench- 
marks, database, tools, etc.? 

The following sections consist the position statements 
written by each panelist under the panel title and the 
suggested topics. 
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Quantitative and Qualitative Concepts 
Herbert Hecht t SoHaR 

For Project Managers the reliability of the computing 
function as a whole is of primary concern, and for that 
purpose a combined quantitative hard ware/softwaic reli- 
ability expression is required. The responsibility for 
hardware and software functions is frequently separated 
immediately below the project management level, and 
therefore the project manager also needs separate 
models for allocating and controlling the achievemuu of 
adequate reliability. For these purposes broad statistical 
reliability metrics are suiiable, particularly failures per 
unit lime of computer usage or time unit loss of com- 
puter availability due to failures. Examples: failures per 
CPU-hr or outage-hrs per month. 

The software manager is responsible for achieving the 
statistical reliability goals but in order to know where 
and how to improve the reliability more specific »cas- 
urements arc required. Quantitative approaches haic so 
far been only of limited use in this domain. Asdiis, 
employment of software development and test tools* and 
test planning arc largely guided by purely qualitative 
considerations. Therefore there exists at present nocon- 
sistent methodology that permits the softw'arc mawgcr 
to meet the quantitative requirements imposed by sys- 
tems considerations wiih the tools at their disposal. 

Two activities can bring about a connection between the 
quantitative and qualitative approaches, and can provide 
sorely needed advances toward achieving more rctable 
software. The first activity is the quantitative analysis of 
failures in terms of software development and tcstiech- 
niques that could have prevented them. The resdting 
data, particularly if they arc weighted by severity of the 
failure, can provide the software manager with concrete 
information on the means of improving the rcliabi&y of 
his/her product. 

The second step deals with the use of quantitative data 
as a test termination criterion. The present practice of 
ending test on the basis of schedule, budget, or (i m the 
very best eases) attainment of a period of failure free 
operation, provides little useful feedback to the Earn 
that developed the software or for the test planning in 
other projects. Reliability growth measurement (taring 
formal test will permit termination on demonstration of 
a defined reliability level and will also provide insights 
into the effectiveness of different development and test 
methodologies. 

I will present examples of these integrated practices. 


Reliability of Real Time Systems 
Hermann Kopetz, Technical University of Vienna 

Since my background is in the area of faul t- tolerant dis- 
tributed real-time systems, my view is determined from 
this position. 

In hard real-time systems, i.e., systems where a failure 
can have catastrophic consequences, a result must be 
correct, both in the domains of value and time. Since the 
behavior in the domain of time depends on the proper- 
ties of the underlying hardware, an integrated 
software/hardwarc view has to be taken. The functional 
correctness of the software per sc (i.e., correctness in the 
value domain) is not sufficient. 

Many failures of real-time systems arc related 10 syn- 
chronization and performance errors which manifest 
themselves as ’transient* system failures. In a failure 
statistics of a complex real-time system [Gebman 1988), 
it is recorded that less than 10% of the failures observed 
in the operation of the system can be reproduced w ithin 
the sophisticated test environment. Similar results have 
been reported by other manufacturers of real-time sys- 
tems. This implies that we do not fully understand the 
character and the interactions of the execution sequences 
which unfold over time in complex real-time systems 
and do not know how to build effective lest procedures. 

This problem has to be attacked from the perspective of 
design. We have to build real-time architectures that arc 
easier to reason about. Most of the present day real-time 
systems arc event triggered, i.e., as soon as an event 
occurs, the computer system takes a decision whether to 
process the task associated with this event immediately 
or the delay processing until sometimes later. These 
dynamic scheduling decisions can take a significant 
amount of processing time, which is then not available 
for the application software. Every different order of 
the events can give rise to a different scheduling deci- 
sion and thus to a different execution sequence. The 
potential input space of event-triggered systems is enor- 
mous. It is difficult to reproduce an input scenario 
because the exact timing of input eases cannot be con- 
trolled easily. There arc no methods known which can 
be applied to reason formally about the timing behavior 
(i.e. the performance) of complex real-time systems. 

If we introduce a time-granularity in the system opera- 
tion by looking at the events only at predefined points in 
the time domain (i.e., a time triggered architecture), the 
plurality of input eases can be substantially reduced. 
Furthermore, sialic scheduling strategies become fcasi- 
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blc. The system structure will be more regular, i.e., 
more predictable and easier to understand and lest. The 
price paid for this reduction in complexity is a reduced 
flexibility. 

We feel that in the field of real-time systems every effort 
must be made to make the system clear and understand- 
able. In our research on distributed real-time systems 
[Kopetz 1989] this has always been our primary goal. 
We have found that time-triggered real-time software is 
inherently easier to understand and test than event- 
triggered software. Further research efforts in this area 
seem to be well justified. 

Statistical Issues in Software Reliability 

Engineering Research and Development 

Douglas R . Miller , George Mason University 

There are two major issues concerning software reliabil- 
ity: achievement and assurance. They arc both very 
important. Obviously, software in critical applications 
must achieve high reliability in order for the system to 
function safely. But it is also necessary' to have strong 
’a priori” assurance that the software is highly reliable 
before it can be put into use. For example, without rea- 
sonable assurance that high reliability has been 
achieved, flight critical avionics software in commercial 
aircraft should not be certified for public use. 

So, the central focus of Software Reliability Engineering 
R&D is methodologies for achieving and assuring 
required levels of software reliability. The goal is reli- 
able software. How do you do it? How do you know 
when you’ve done it? Furthermore, what are the most 
efficient ways to achieve and assure the reliability? 

A central idea concerning reliability is "uncertainty." A 
given piece of software may or may not contain design 
flaws which will manifest themselves as system failures 
when the software is used at some time in the future. 
The point is that uncertainty is inherent to this 
phenomenon: we do not know if failures will happen 
and, if they do, when they will happen. To deal with 
this uncertainty, a scientific approach should be taken. 
The scientific approach involves experimentation, data 
collection, statistical modelling and analysis, and draw- 
ing inferences and conclusions which will support deci- 
sions about developing, testing and using software. The 
existence of probability seems inevitable here. It is 
necessary to quantify the uncertainty in terms of proba- 
bilities of various events occurring. 


Based on information or data concerning software 
development, testing, previous failures, the usage 
environment, and any other observables, we would like 
to estimate (with confidence) the probability that a par- 
ticular piece of softw are fails during a given lime inter- 
val. 

Reliability growth models attempt to estimate current 
reliability and predict future reliability growth for a 
given piece of software. These models base their esti- 
mates and predictions only on past failure limes of the 
given piece of software. IBM’s Clean Room used relia- 
bility growth models successfully. At the May 1990 
Meeting of the IEEE Subcommittee on Software Relia- 
bility Engineering, successes were also reported by 
AT&T, HP and Cray Research. Unfortunately, the relia- 
bility growth modelling approach is limited in many 
ways: The models treat the software as a black box and 
are only valid for random batch (mcmorylcss) testing or 
usage. The distribution of usage must be well know. 
The models do not make use of additional data or infor- 
mation which comes out during testing or usage. The 
approach docs not give useful estimates for extremely 
high levels of reliability (c.g., avionics software and 
other safety-related systems). 

There arc many factors which contribute to the re-liabil- 
ity of a piece of software. Case studies such as those 
sponsored by NASA Goddard s Software Engineering 
Laboratory explore the effect of various factors on 
software quality. Factors of interest include different 
development scenarios, different testing strategies, 
characteristics of programmers, and others. It can be 
shown that software quality correlates with various 
known factors, but calculating reliabilities from these 
factors seems difficult if not impossible. One very 
important category of information which should ha\c 
significant value in predicting reliability of a piece of 
software is the programmer’s personal subjective esti- 
mate of its reliability, especially after he has seen and 
done a post mortem on the first few bugs discovered. 

Current practice is often based on engineering judge- 
ment. For example, commercial avionics software must 
be produced following guidelines presented in DO- 
178A, "Software Considerations in Airborne Systems 
and Equipment Certification," prepared by Special Com- 
mittee 152 of the RTCA and currently under revision by 
Special Committee 167. If appropriate documentation 
supports compliance, the FA A certifies the software. 
The actual software is never examined as part of the 
certification. A major challenge facing the discipline of 
Software Reliability Engineering involves justifying this 
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type of approach (also contained in various Military 
Standards) in some objective, scientific sense. 

To summarize: i)For certain classes of software pro- 
jects, quantitative reliability estimation and prediction is 
possible (and is done) for individual programs. 
ii)Through general case studies it is possible to identify 
factors effecting reliability and thus a get qualitative 
sense of what constitutes good software development 
practice. iii)For many critical software systems requir- 
ing high reliability, the approach to reliability is very 
subjective. 

It is clear that a quantitative, objective approach to 
software reliability should be applied to more software 
projects. This means going beyond the current practice 
of software reliability growth modelling. The key seems 
to be: It is necessary to use available data much more 
efficiently (and imaginatively). There are two 
categories of data sources: Additional data can be col- 
lected (and used) specific to any particular piece of 
software whose reliability is being assessed. More 
importantly, there is data from similar and related pieces 
of existing software; I don’t think we know how to make 
effective use of this data. 

The goal is better quantitative understanding (and 
exploitation of that knowledge) of many software 
phenomena: behavior of real-time control systems, intri- 
cacies of fault-tolerant systems, efficacy of testing, 
identification of usage distributions, etc. All this 
knowledge is related to classes of software. (It is neces- 
sary to understand more than single software systems 
individually, one at a time.) Software metrics must be a 
key feature in this general quantitative understanding, 
because the similarity between pieces of software must 
be measured in order to define classes of software. 

To progress it is necessary to acquire data. An ideal (but 
expensive) source is controlled experimentation. For 
example, NASA Langley continues to sponsor experi- 
ments where replicated software is written. A better 
understanding of replicated batch-processing software 
has emerged from such experiments. Current experi- 
ments should improve understanding of replicated real- 
time control software. A second general source of data 
are real software projects. A prime example is the data 
collected and published by Musa; his data stimulated a 
flurry of activity in reliability growth modelling. Such 
experimentation and data collection is crucial. Experi- 
menting and collecting useful data across general classes 
of software projects is a tremendous challenge. 


The Software Reliability Gap: An Opportunity 
John D . Musa , AT&T Bell Labs. 

We are in the middle of both a problem and an oppor- 
tunity. I like to call it the “software reliability gap” 
because the needs of software customers have outrun the 
current practice of software engineering. You can’t tell 
whether they have outrun Lhe technology, because there 
is much technology that hasn’t been refined and applied. 

The core of the problem is that intense international 
competition has made unidimensional needs obsolete. If 
we only needed to add reliability to software products, 
we would have many tools and methodologies to help 
us. The problem is that other customer requirements, 
such as level of cost and delivery date, would not be 
met. Customers have multidimensional needs that arc 
interdependent and hence must be set and met more pre- 
cisely than ever before. The precision required can only 
increase in the future. 

Thus measurement is inevitable. Models arc also inevit- 
able; we need to know the factors that influence product 
attributes and how much each of them docs, so that the 
software development process can be controlled to yield 
the desired objectives for the attributes. In short, com- 
petition is creating a technological vacuum or gap. 

The principal quality attributes that customers cite as 
being significant arc reliability, cost, and delivery date. 
Software reliability engineering is the last to develop of 
the three technologies supporting the measurement and 
modeling of these attributes. It is the keystone that 
makes quantitative software quality engineering possi- 
ble. Since quantitative hardware quality engineering 
already exists, the development of software reliability 
engineering also makes quantitative system quality 
engineering possible. 

Thus there is an enormous and rare opportunity to fill a 
widening gap, which makes this an exciting and chal- 
lenging time. 

What must software reliability engineering do to meet 
the challenge? In my opinion, several general things: 

(1) We need to induce a variety of projects to try iL 
This is already happening, but greater variety 
would be useful. Care must be taken that it be 
applied correctly. 

(2) The experience on these projects must be 
recorded, critiqued by others knowledgeable in 
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ihc field (to guard against misinformed applica- 
lions), and published. 

(3) Published experience should be organized and 
digested, so it can be more easily taught to pracu- 
tioners and future practitioners. 

(4) Problems that are blocking further progress and 
opportunities for new areas of application need to 
be identified, and they should be addressed by 
researchers. 

These activities clearly offer major possibilities for prac- 
titioners, researchers, and educators. People who 
acquire and use software play an important role in clari- 
fying the needs of the customer that are at the core of 
the driving forces acting on software reliability 
engineering. 

Can I say anything more specific? I would like to close 
by entering brainstorming mode and throwing out some 
thoughts for you to discuss: 

(1) We need research to lie software reliability more 
strongly to the earlier part of the development pro- 
cess. Part of this effort involves determining how 
fault density is affected by product and process 
variables. 

(2) Little has been done to fulfill the promise of 
software reliability engineering for evaluating 
software engineering methodologies and tools. 
We need to help people do this. 

(3) We need data on human and computer resource 
usage in test, so that resource usage parameters 
can be determined. 

(4) The AIAA software reliability engineering guide- 
lines effort, which includes development of a 
handbook, looks promising. Because of die diver- 
sity of contributors involved, it will be important 
to devote much effort to interaction between and 
integration of their views. We don’t want a cata- 
log. 

(5) We need to strongly support our newsletter and 
our conference through personal participation in 
exchanging practical experience and research 
results. We need to keep the exchange flowing all 
year through our working committees. 

(6) We need software tools (with as many generic 
elements as possible) to record as large a propor- 
tion of failures as possible automatically, particu- 
larly in the field but also in test. We need to 
integrate this system with manually-reported 
failure systems, but consider implementing the 


manual reporting online rather than on paper. 

(7) The Software Engineering Institute has a metho- 
dology for assessing the quality level of software 
development processes. It docs not currently 
directly include a software reliability engineering 
program among its assessment criteria. It should, 
and we should discuss with them how to add it. 

I hope you will not only discuss these ideas here, but 
chew on them later as well. I hope you will add to this 
necessarily partial list of opportunities for action. I hope 
you will then seize some of them that appeal to you, and 
return as significant contributors next year or the year 
after. 


Software Reliability Engineering 
from Japanese Perspective 

Mils Ohba, IBM Corporation 
"The wave comes from the East." 

Both the computer technology and the quality control 
method were invented and matured in the US. and they 
were brought into Japan later. Japan has so far caught 
up quickly and become competitive in both areas. Espe- 
cially, Japan is viewed as the leader in the area of qual- 
ity control and quality management 

"Technology transfer begins when it is imported. 

If we carefully review the processes by which Japan has 
caught up and gone further, we can find some similar 
patterns of technology development. The processes gen- 
erally begin at the importing phase where technology is 
investigated and evaluated. Then there is the deploy- 
ment phase, the migration phase, and finally, the Japani- 
zation phase. 

"How does it go through?" 

The deployment phase is the phase where the imported 
technology is widely used and the know-hows associate 
with it are accumulated. The migration phase is the 
phase where components of the technology are adjusted 
for the target environment(s). The Japanization phase is 
the phase where something additional and unique to 
Japan is added to the technology. 

"How has Japanese software engineering evolved?" 

Software engineering is a case in point. It was intro- 
duced into Japan in 1977, which was two years later 
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than the first IEEE Transaction on Software Engineering 
issued. Two years were spent on the importing phase 
followed by two years of deployment. The migration 
phase began in 1982 and lasted six years. The Japaniza- 
tion phase began in 1988. An example of the Japaniza- 
lion phase is what has become known as the "Software 
Factory" concept. 

"Software reliability research is not an exception." 

As a domain of research, software reliability engineer- 
ing is not an exception to the Japanese process. The ear- 
lier work done in the US by Musa, Goel and Okumoto 
drew the attention of Japanese reliability researchers as 
their new field of study. 

"What have Japanese researchers done in this field?" 

To date they have: 1) evaluated the basic models pro- 
posed by the American researchers by applying them to 
real project data, 2) modified the models in order to fit 
the data, 3) developed new models by examining the 
implication 'of data and the assumptions of the basic 
models, and 4) addressed the new research issues of 
models to be resolved. 

"Software factory did not need theories." 

On the other hand, software reliability engineering as a 
practice has evolved differently. It was begun as a 
branch of software quality control practices in order to 
determine whether a product developed by a vendor was 
acceptable. The logistic curve model and the Gompertz 
curve model were widely used in the industry and 
became dc facto standard models for software factories. 

"Technology transfer is really the problem." 

The implementation of the theory which has been 
developed by Japanese researchers is very slow. This is 
because the old models, with which the practitioners are 
familiar, are still sufficient for their needs. They will not 
change as long as the old practices work or until they 
recognize the advantages of the new theory. This is 
similar to the fact that people had believed the stars were 
rotating. 

"How can we convince the people that the earth 
rotates?" 

The most serious issue of software reliability engineer- 
ing as a practice in Japan is the education of the people. 
It is similar to teach them that the earth rotates, not the 
stars. The models are not crystal balls. Prediction is 
made based on a set of assumptions. If the assumptions 
arc not valid, a model based on them becomes a great 


nonsense. The Gompertz curve fits most of practical 
project data because of its flexibility. But, no one can 
explain what the model really means. 

"Why do we believe that the earth is rotating?" 

The most serious issue as a domain of research is to 
explain the relationship between test cases and reliabil- 
ity growth using reasonable models, which is also simi- 
lar to explain the reason why the earth seems to be rotat- 
ing. What software reliability growth tells is characteri- 
zation of the state of software under evaluation. It docs 
not tell how we can improve testing. Obviously, time is 
not the real factor for improving software reliability dur- 
ing the test phase. 

"Can measurements and data be standardized?" 

A serious issue for both practitioners and researchers is 
to establish standard ways of measuring software relia- 
bility in practice. The models arc based on a set of 
assumptions. The models should be categorized based 
on 1) what they can predict (c.g., MTTF, number of 
errors), 2) what type of data they need (c.g., lime 
between failures, number of failures between observa- 
tions), 3) what assumptions they arc based on, and 4) 
what type of software they can analyze. 

Back To The Future 

David Siefcrt, NCR 

For the past 20 years, Software Engineering has pro- 
vided us with the capability for producing highly reli- 
able software. Software reliability is achieved, in pan, 
through the applied discipline of standardized practices, 
methodologies, tools, and processes comprising the "sci- 
ence" of Software Engineering. Today, dependence on 
automation is greater than at any point in time in the 
world’s history. Highly reliable products are expected 
and assumed! The very nature of the level of sophistica- 
tion and complexity of modem systems arc intended to 
be transparent to the end-user. 

Applying Software Reliability Engineering Discip- 
lines 

Interestingly, the same practices, methodologies, etc. 
that lead to the development of reliable software arc also 
the downfall! Why after all these years of "learning" is 
the world still not applying and improving Software 
Engineering disciplines etc.? Why do practitioners still 
develop and maintain software based upon the 
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approaches used 20 years ago (lack of applied discip- 
line)? Why is it that researchers do not yet know 
exactly what is the minimum that should be done to 
develop reliable software? In support of consistently 
producing reliable software, why after 20 years is there 
still not a national database leading to the consistent pro- 
ject data collection, analysis, and ultimate determination 
of practices, tools, and therefore required disciplines? 
Shouldn’t a Software Engineering "Blucbook" exist? 

Software Reliability Engineering is addressed in the fol- 
lowing two ways: 

(1) Technical Aspects of Software Reliability 

Technical software reliability consists of many 
items. Determining reliability goals is one 
activity. Reliability goals are typically referred to 
in "technical" terms. These technical terms arc 
placed in product specifications. As it pertains to 
Software Reliability Engineering, these terms or 
goals arc then tracked through product production 
to the achievement of the goals. The environment 
that the software was produced in, plays a 
significant impact on the results. These specified 
reliability goals often arc determined through the 
application of software reliability models. An 
AIAA effort addressing Software Reliability is in 
the process of providing guidance to industry on 
which models to use and when. The computing 
industry has yet to standardize these specific 
models. 

(2) End-User Software Reliability 

The second form of Software Reliability 
Engineering is that of the end-user. The technical 
specifications which include the software reliabil- 
ity goals are expected to be mapped directly to the 
end-user’s needs and expectations. Too often 
there is no known methodology to take qualitative 
and rather subjective unstructured feedback from 
the end-user and transform them into quantifiable 
and technically oriented input for use in determin- 
ing software reliability. Without this methodol- 
ogy, there will remain to be software reliability 
difficulties. Meeting "specification" infers meet- 
ing the end-user’s expectations. Meeting 
specification is certainly one essential form of 
measurement. Technical specifications are the 
result of analysis of the end-user’s expectation - 
not the other way around. Too often the technical 
specification and the end-user’s expectations are 


distinctly separate with no relationship between 
each other. This results in minimal confidence 
that the product w ill achieve it's expectations. 

Environmental issues arc also important. To understand 
software reliability, one must understand the environ- 
ment software resides. The environment for software is 
systems! System components include other software 
and hardware. Reliability should be computed or budg- 
eted in such a manner that reliability for each of the 
components of the computer environment can be deter- 
mined, evaluated, measured, and tracked separately 
Reliability should also address a "total" system or 
enterprise-wide solution. Typically, the end-user i 
affected by using or experiencing the "total" system 
They typically have no ability to decipher the type c 
defect or anomaly that has occurred. It is not clear tha 
they should. At any rate, Software Rcliab.aty Engineer- 
ing needs to address the "total" system as well as the 
individual system components. 

The Software Engineering community has reliability 
models that lead to establishing reliability goals. "High 
Confidence" goals (outputs) produced through die use of 
these models arc dependent upon past history. This his- 
tory should be retained in the form of a database. 
Interestingly, no new significant software estimation 
models have been revealed in the past 5 years. Widiout 
the use of such databases as input to and the ’Tuning" of 
such models, the community is no closer to estimating 
with high confidence levels the goals produced from the 
models as was able to be attained 5 years ago. The 
goals produced through the use of these models may not 
be any better than the "guess" of you or I. 

Besides past history, the technically specified software 
reliability goals are established and dependent on some 
basic items of information: 

- How is end-user’s "needs" quantified? 

- What is a software error, fault, and failure? 

- What are the categories of software? 

How is Defect and Fault Density computed? 

What and how is line-of-code or Function Point, 
by language, determined? 

How is line-of-code or Function Point translated 
between languages? 

How is Defect Density affected by software pro- 
duction environmental issues? 

How is software to be tracked? 
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Recommendations in Improving Software Reliability 

• For Practitioners: 

(1) Practitioners most apply the disciplines considered 
to Software Engineering. Techniques, methods, 
tools, etc. as associated with planning, design, 
development, testing (including verification and 
validation), should be learned and rigidly applied. 

(2) Each software production (or maintenance) organ- 
ization should develop and maintain a Software 
Engineering Environment Process (SEEP). This 
process should consist of all disciplines, tools, etc. 
actually used in the production of the software - 
including the measurement systems, of which 
software reliability is a part. 

(3) Practitioners should develop a database of past 
projects. The database should consist of such 
information as: the environment that produced the 
software, skill and types of personnel producing 
the software, Defect Densities, etc. This database 
is to be used as a basis for a Software Reliability 
Measurement Program (SRMP) and positioning 
for continuous improvement in Software 
Engineering. 

(4) A software reliability measurement program 
(SRMP) should be put into place that consists of 
measures that address both the scope of the 
Software Engineering Environment Process and 
specific product related results. Measures should 
consist of indicator measures, c.g., Test Coverage 
and estimator measures - models to estimate relia- 
bility. The measurement program should consist 
of a methodology that addresses the use of the 
models beginning with the "how to" develop relia- 
bility goals and ending with an approach of a pro- 
ject post mortem. The previously mentioned data- 
base would maintain all data. The database would 
provide for causal root cause analysis and process 
improvement of the Software Engineering 
Environmental Process. 

• For Computer Scientist Researchers: 

(1) Researchers are to develop and maintain a 
national database (see above). The information 
contained in the database as previously noted 
should contain both product and environmental 
information. Researchers should evaluate the 
information in such a manner as to determine the 


best practices, methods, required skills etc. to con- 
tinuously improve software reliability. 

(2) Researchers should provide standards on such 
subjects as: language constructs, linc-of-code 
definitions. Function Point, etc. 

(3) Researchers should determine minimum impacts 
as to how to conclude with deriving "high 
confidence" software reliability goals, etc. 
Models are to be evaluated and maintained. 

(4) Researchers should also determine education cur- 
ricula for software engineering enabling the con- 
tinuous achievement of high confidence reliable 
software. 

(5) Researchers should determine how to quantify 
results from evaluating user’s needs. These 
results arc used as input into various different reli- 
ability tools, models, etc. as discussed earlier. 

(6) Researchers should establish and maintain a "Blue 
Book for Software Engineering." 

Concluding Comments 

The world continues to embrace higher and higher levels 
of technology. Software is at the heart of the demand 
for complex features and functions which arc packaged 
to make the complexity transparent to the end-user. 
High confidence software reliability is in jeopardy. 
Software Engineering processes that consist of discip- 
lines, tools, methods, etc. are not being utilized con- 
sistently. The science of Software Engineering is not 
being practiced. 

A need exists to focus on the basics; in the simplest form 
of understanding software and Software Engineering. 
Data needs to drive decisions. Attaining highly reliable 
software - consistently - positioned through processes 
for the purpose of improvement is essential. Research- 
ers need to provide the "data driven" credibility in the 
baseline evaluations of software and software environ- 
ments (and processes). Researchers need to see that the 
appropriate Software Engineering disciplines are applied 
- consistently and appropriately, evaluating the results, 
and improving the disciplines and processes. 

The disciplines exist in the form of Software Engineer- 
ing to produce reliability software! The discipline and 
formality required to achieve the results remain to be the 
challenge! The solution is: " go BACK and apply the dis- 
cipline TO get to THE FUTURE ..." 
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